Evolvability and Learning

This post investigates the paper Evolvability by computer scientist Leslie Valiant analyzing evolution and its limits through the lens of machine learning theory.

The first section gives an overview of Valiant’s now famous Probably Approximately Correct learning model, and the material is derived from Kearns and Vazirani’s short textbook Introduction to Computational Learning Theory. Valiant’s more recent notion of evolvability (explored in the second section) implies PAC learnability, but not conversely. In particular, the parity functions (i.e. the mod 2 linear functions ${\{0,1\}^n\rightarrow\{0,1\}}$) are learnable, but provably not evolvable (under Valiant’s or any similar notion of evolvability).

Throughout, I’ve included some fun pictures, produced by simple simulations in Python with matplotlib.

1. Probably Approximately Correct

As motivation for Valiant’s Probably Approximately Correct (PAC) learning model, consider the following scenario. There is some closed axis-aligned rectangle ${R=\prod_{i=1}^n[a_i,b_i]}$ (called the target) in the unit cube ${[0,1]^n}$ whose shape we’d like to learn from data. Say, we can draw examples from some oracle which outputs a sequence of pairs ${(\vec x_j, y_j)}$ with ${\vec x_j}$ drawn iid uniformly from ${[0,1]^n}$ and ${y_j}$ the boolean value of the proposition ${\vec x_j\in R}$.

We want a learning algorithm that can form a reasonable hypothesis for ${R}$‘s shape from a finite sample of data like the following.

Here positive and negative examples are labeled black and white, respectively. Ten points don’t constitute a lot of data, but it’s enough to suggest an obvious candidate guess for ${R}$‘s shape, namely the closest-fit rectangle around the positive examples.

A nice property of the closest-fit algorithm is that it always produces a hypothesis consistent with the data, assuming the oracle does its job perfectly. Of course, with such a small sample size, there may be substantial error (measured in area) between the closest-fit rectangle and the true target.

The hope is that this error quickly converges to ${0}$ as the number ${m}$ of samples grows. The pictures above and below are produced by this simulation code, and they indeed suggest convergence.

Here’s a graph of the error in the above simulation as a function of the number of samples.

Rerunning the simulation 100 times suggests this behavior is typical.

The closest-fit algorithm is very simple, and convergence should come as no surprise. But what does convergence mean here? If we’re very unlucky, we could see a long sequence of data points that doesn’t result in low error for the closest-fit algorithm.

The importance of Valiant’s work on the PAC model is that it provides a rigorous mathematical framework for proving the correctness of such algorithms. (All of the standard disclaimers about the model being unrealistically simple and so forth apply. Some such concerns, e.g. learning from a noisy oracle, have been studied, but this overview isn’t really the place for them.)

The PAC criterion essentially says that the error quickly converges to 0 in probability for any target ${R}$, and actually for any (Borel) probability distribution ${D}$ on ${{\mathbb R}^n}$. Explicitly, we mean:

• ${D}$ is both the training distribution used by the oracle to produce samples and the target distribution according to which the error of the algorithm is judged.
• Convergence in probability means for any ${\varepsilon>0}$ and ${\delta>0}$ there’s a number ${M}$ called the sample complexity such that after ${m\geq M}$ samples, the probability is at most ${\delta}$ that the algorithm will output a hypothesis with error exceeding ${\varepsilon}$. (So the chance of noticeable failure is slim in the face of a large amount of data.)
• By “quickly”, I mean that we can put an upper bound on the sample complexity which is polynomial in ${1/\varepsilon}$, ${1/\delta}$, and the ambient dimension ${n}$.
• The parameter ${\delta}$ corresponds to “probably”, and the parameter ${\varepsilon}$ corresponds to “approximately correct”. To the extent that we only care about bounds being polynomial, we could replace ${\varepsilon}$ and ${\delta}$ with their minimum, without changing the criterion of learnability.
• It’s sometimes worthwhile to allow the algorithm to output hypotheses from a slightly more general class than the target can actually come from.
• And there are a few more subtleties that I’m happy to brush under the rug.

Theorem 1 The closest-fit rectangle algorithm suffices to learn axis-aligned rectangles in ${{\mathbb R}^n}$ in the PAC model with sample complexity

$\displaystyle m\ \geq\ \frac{2n}{\varepsilon}\ln\left(\frac{2n}{\delta}\right).$

Proof: The main idea is that the algorithm can only output a hypothesis rectangle with error ${>\varepsilon}$ if at least one of the ${2n}$ sides of the target rectangle is not “well-defined” by the data by having a positive example nearby, and a reasonably large number of samples makes this occurrence rare.

Explicitly, for fixed target ${R=\prod_{i=1}^n[a_i,b_i]}$, we mean for each ${i}$, let ${a_i'>a_i}$ be least such that the sub-rectangle ${A_i\subseteq R}$ with ${i}$-th component ${[a_i',b_i]}$ has at least ${\varepsilon/2n}$ less area than ${R}$, relative to ${D}$. Note that if ${D}$ has a probability density function, i.e. is absolutely continuous, then the measure of ${R\backslash A_i}$ is exactly ${\varepsilon/2n}$.

Define ${b_i' and ${B_i\subseteq R}$ similarly.

If the sample ${\vec x_1,\ldots,\vec x_m}$ hits every ${R\backslash A_i}$ and every ${R\backslash B_i}$, the output hypothesis will have error ${\leq\varepsilon}$. So error ${>\varepsilon}$ can only occur if the ${m}$ samples miss at least one of these ${2n}$ sets, which each have measure at least ${\varepsilon/2n}$. That is,

$\displaystyle \mathop{\mathbb P}(\mathrm{error}_D>\varepsilon)\ \leq\ 2n(1-\varepsilon/2n)^m\ \leq\ 2n\exp(-\varepsilon m/2n),$

using the inequality ${1-x\leq\exp(-x)}$, valid for all real ${x}$ by linear approximation and convexity. Finish the proof by solving for ${m}$ to make the rightmost expression in the above ${\leq\delta}$. $\Box$

Corollary 2 The class of conjunctions of boolean literals, e.g. ${x_2\wedge x_3\wedge\neg x_7}$ is PAC-learnable with the same sample complexity as for axis-aligned rectangles.

Proof: Boolean conjunctions are the special case of axis-aligned rectangles where the target distribution ${D}$ is supported on the finite set of bit vectors ${\{0,1\}^n}$. Explicitly, the initial empty rectangle corresponds to the unsatisfiable conjunction ${x_1\wedge\neg x_1\wedge\ldots\wedge x_n\wedge\neg x_n}$, and expansion of the closest-fit rectangle corresponds exactly to dropping literals from the hypothesis conjunction when they’re inconsistent with the data:

• ${[a_i,b_i] = [1,1]}$ means ${x_i}$ is in the conjunction and ${\neg x_i}$ is not.
• ${[a_i,b_i] = [0,0]}$ means ${\neg x_i}$ is in the conjunction and ${x_i}$ is not.
• ${[a_i,b_i] = [0,1]}$ means neither ${x_i}$ nor ${\neg x_i}$ is in the conjunction.

$\Box$

One notable difference between rectangles and conjunctions is that there are only finitely many conjunctions (and finitely many bit vectors). There are exactly ${3^n}$ satisfiable boolean conjunctions, as for each ${i}$, at most one of ${x_i}$ and ${\neg x_i}$ can appear, and (functionally) there is only one unsatisfiable conjunction, containing some ${x_i}$ and its negation.

We mentioned that one desirable aspect of the closest-fit rectangle (and hence the closest-fit conjunction) algorithm is that it always produces a hypothesis consistent with the data. Conveniently, production of a consistent hypothesis is actually sufficient for PAC learning, for finite classes of functions which grow no faster than ${\exp(p(n))}$ for some polynomial ${p}$.

Theorem 3 (General Learning Bound for Finite Classes) A learning algorithm which always produces a consistent hypothesis from some finite set ${H_n}$ will, with probability at least ${1-\delta}$, have error ${<\varepsilon}$ for sample size

$\displaystyle m\ \geq\ \frac{1}{\varepsilon}\ln\left(\frac{|H_n|}{\delta}\right).$

In addition to simplifying (probable approximate) correctness proofs for many natural classes of functions ${\{0,1\}^n\rightarrow\{0,1\}}$, this result improves the bound for learning boolean conjunctions (without changing the algorithm), from ${O(n\log n)}$ dependence on dimension to ${\ln(3^n+1)=O(n)}$.

The proof is surprisingly simple.

Proof: Consider all hypotheses ${h\in H_n}$ with error at least ${\varepsilon}$. For such ${h}$ to remain consistent with ${m}$ labeled examples, each example must independently miss the region on which ${h}$ differs from the target, which we assumed has area ${\geq\varepsilon}$ under the target distribution. For individual ${h}$, this happens with probability at most ${(1-\varepsilon)^m\leq\exp(-\varepsilon m)}$, using ${1-x\leq\exp(-x)}$. Summing this bound over all ${h\in H_n}$, we see that the probability that there exists any hypothesis with error at least ${\varepsilon}$ remaining consistent with ${m}$ samples is at most ${|H_n|\exp(-\varepsilon m)}$. Solve for a lower bound on ${m}$ that makes this ${\leq\delta}$. $\Box$

The General Learning Bound is often billed as a version of Occam’s Razor, expressing the idea that simple explanations (or rather, explanations belonging to a small reference class ${H}$, which is basically the same thing) are more likely to be correct than elaborate ones, in the case that simple ones are sufficient for explaining the data.

A very important example is the class of parity functions ${\{0,1\}^n\rightarrow\{0,1\}}$. These are the mod-2 linear functions ${({\mathbb Z}/2{\mathbb Z})^n\rightarrow{\mathbb Z}/2{\mathbb Z}}$; equivalently, they are formed as xor’s of sets of positive boolean literals.

Corollary 4 The parity functions are PAC-learnable with sample complexity

$\displaystyle m\ \geq\ \frac{1}{\varepsilon}\left(n\ln2+\ln\left(\frac{1}{\delta}\right)\right).$

Proof: There are only ${2^n}$ parity functions, so by the General Learning Bound, it suffices to give an algorithm that produces results consistent with the data. By linearity over ${{\mathbb Z}/2{\mathbb Z}}$, values for any set of bit vectors ${\vec x_1,\ldots,\vec x_m}$ determine values for any bit vectors in the linear span of that set. So the algorithm should find a maximal linearly independent subset ${B}$ of the examples, extend it to a basis ${B^*}$ of all of ${({\mathbb Z}/2{\mathbb Z})^n}$, and output the linear function that sends any labeled ${\vec x_j\in B}$ to its appropriate label ${y_j}$, and sends basis vectors in ${B^*\backslash B}$ to ${0}$. $\Box$

To end this section, I’ll note that the General Learning Bound/Occam’s Razor actually admits a generalization to sufficiently simple infinite function classes, such as the axis-aligned rectangles in ${{\mathbb R}^n}$. Namely, for classes with finite Vapnik-Chervonenkis dimension ${d=d_n}$ growing only polynomially in ${n}$, identification of consistent hypotheses is sufficient for PAC learning. See Chapter 3 of Kearns and Vazirani for details and examples.

2. Evolvability

Now we’re ready to talk about Valiant’s idea that evolution can be studied via computational learning theory, particularly from within the PAC model.

Let’s look first at the negative result that the parity functions can’t feasibly be evolved under his definition, despite the fact they can be learned as in Corollary 4 above. (Valiant then provides a few modest positive results, so the negative result is non-vacuous.) The particulars of his definition are rather complex, but non-evolvability of the parity functions will still hold for any notion of evolvability bearing a few reasonable similarities.

Given a function class ${C}$ (of possible “genotypes”), a reasonable notion of ${C}$ being evolvable should include a biased random walk through ${C}$.1

While the set of possible successors of any element ${c\in C}$ can and should depend on the particular details of how ${c}$ is represented as a mathematical expression (analogous to small genetic mutations), the bias determining which is selected should not. Rather, selection should only be biased in terms of error relative to an unknown target function (“high-fitness phenotypes” are reliably selected for “in the wild”).

Then evolvability should require (at least) that a polynomial-length walk should have a non-negligible probability of producing a non-negligible increase in performance. Valiant asks for high probability of low error, but notes that it may be worthwhile to consider relaxations; for example fitness may improve for a while before getting stuck at a local optimum.

But the issue with parity functions is that there’s no such thing as approximating one with another: the difference (xor) of two parity functions is itself a parity function. So two parity functions either agree exactly, or they differ on half of all possible inputs.

Under the uniform distribution on inputs, no performance feedback is possible, short of accidentally stumbling upon the target function exactly. With ${2^n}$ parity functions, this won’t realistically happen in a polynomial-length walk. And so one can say that the generally misguided irreducible complexity argument actually does apply to parity functions!

But what about trying to make positive progress? How might we evolve axis-aligned rectangles, say under the uniform distribution on ${[0,1]^n}$? Under Valiant’s model (which is rather laborious to write down in full generality), it might look something like this.

• Initialize the hypothesis rectangle to all of ${[0,1]^n}$.
• At each time step (“for each generation”), find ${2n}$ neighboring rectangles by individually perturbing each side by ${\pm\rho}$ for some small ${\rho>0}$, possibly chosen randomly.
• For some predetermined number ${s}$, take a sample of uniform random points ${\vec x_1,\dots,\vec x_s}$, and judge all the hypotheses above based on their performance on this sample, relative to the correct labels. The number of times that a hypothesis gets a label wrong, divided by ${s}$, is called the empirical error.
• A “mutation” resulting in lower empirical error than the old hypothesis by at least some threshold amount ${t>0}$ is called beneficial. If there is a beneficial mutation, choose one of them uniformly at random as the new hypothesis.
• (If there is no beneficial mutation, then pick the new hypothesis uniformly at random from the “neutral mutations” whose empirical error has not increased by more than ${t}$. If the old hypothesis is included in the new generation, this step will never fail.)
• Repeat for some number of steps ${g}$.

Note that we can’t really expect the error to get much lower than the perturbation amount ${\rho}$, but hopefully with ${\rho}$ small, the other constants ${s}$, ${1/t}$, and ${g}$ don’t need to be too large for the error in the process above to reliably settle around something small in ${\rho}$.

Giving phrases like “reliably settle” and “not too large” meanings in terms of ${\varepsilon}$‘s, ${\delta}$‘s, and polynomial bounds as Valiant does, it’s clear the the above process can be simulated in the PAC framework so that evolvability (over a distribution) should imply PAC learnability (over that distribution). One subtlety is that the neighborhood around each hypothesis needs to depend on the desired precision and confidence, as through ${\rho=\rho(n,\varepsilon,\delta)}$ above.

I claim the above works reasonably well for what it claims to do. Figuring out all the necessary polynomial bounds to formally prove evolvability seems quite a bit more involved than proving PAC learnability, and not necessarily something I wanted to do. (Although the different approach in Valiant’s Proposition 5.1 on conjunctions might be possible to adapt to rectangles.)

Here’s an animation from a small simulation I ran (code here, animated gif created with imagemagick convert) in 2 dimensions with ${t=.01}$, ${s=100}$, ${g=20}$, and relatively large perturbation ${\rho}$ chosen uniformly randomly between ${0}$ and ${.2}$ for each mutation. As above, the hypothesis is shown in blue and the target in green. Click to start the animation.

It seems to stabilize relatively quickly around a small error.

A larger simulation with ${t=.002}$, ${s=100}$, ${g=200}$, and ${\rho}$ uniform random between ${0}$ and ${.02}$ shows similar behavior. The settling is, of course, slower with smaller perturbations, but the steady-state error is much less.

So it seems likely that this algorithm can evolve axis-aligned rectangles in the unit square ${[0,1]^2}$, over the uniform distribution, starting from ${[0,1]^2}$ itself. This, of course, not a proof, and I have not even considered the general ${n}$-dimensional case.

It would be nice if evolution could start from anywhere and still reliably reach the “correct” result. However, this seems to be too much to ask: evolution can get stuck at local optima, and this phenomenon indeed shows itself for rectangles. If the initial rectangle has little or no overlap with the target, error is decreased by simply shrinking the hypothesis to the empty rectangle!

The particular way I’ve written the simulation, it’s possible for a small or zero-area rectangle to “drift” by a sequence of neutral mutations toward the target, at which point expansion will start to decrease error again. But this requires a relatively specific sequence of mutations, so the local optimum at an empty rectangle is almost stable.

Valiant’s full, general definition of evolvability actually does make the requirement that evolution could proceed from any starting point, as well as the requirement that any target distribution will do. The idea, then, is that this distribution could even change at some point in time, and evolution could proceed in another direction, without having backed itself into a corner during the first phase. This seems like quite a bit to ask, and even Valiant’s two positive results each ignore one or both of these desiderata! They are:

• Theorem 5.1: Conjunctions of positive boolean literals (e.g. ${x_1\wedge x_7}$, but not ${x_1\wedge\neg x_7}$) are evolvable over the discrete uniform distribution, from any starting point.
• Proposition 5.1: Conjunctions of arbitrary boolean literals are evolvable over the discrete uniform distribution, starting from the (always true) conjunction of no literals, and actually assuming a slightly more general representation class. Specifically, Valiant uses (integer, conjunction, conjunction) triplets at each step, as kind of a hack to simulate “evolution with optimization”, which tries to pick the best mutation at each step.

Ultimately I think Valiant’s full wish list for evolvability is a bit much if any non-trivial class is to satisfy the whole thing, and I’d say the lack of completely satisfactory results in the paper is telling.

The most obvious issue to my mind is that distributions which are highly concentrated on a subset of extremely small measure will throw a wrench into reasonable mutation processes: extremely small genotypic mutations (and hence many generations) will be required to tune phenotypes accurately.

A reasonable relaxation of “evolvable over any distribution” should take such concentration of measure into account, perhaps only considering distributions which never differ from the uniform distribution (continuous or discrete version, as appropriate) by more than a constant multiplicative factor ${c}$ in their probability density/mass functions. Perhaps all relevant bounds on ${s}$, ${1/t}$, ${g}$ and so forth should be required to depend only polynomially on ${c}$ and ${1/c}$ as well. This unfortunately complicates the already large definition, but I do suspect something like this is necessary.

Notes

1. One might still try to evolve parity functions with some richer class ${H\supseteq C}$, where the argument we give no longer holds. Valiant’s full argument is slightly more sophisticated: it points out that evolvability under any representation class ${H}$ implies learnability in the Statistical Query model, which is a weaker version of PAC in which parity functions are famously not learnable.