## Counting bits with Vapnik and Chervonenkis

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. I will describe some of their results, recast in a more explicitly information-theoretic way.

We will stick to the setting of binary classification: we have a feature space (which is completely arbitrary) and a binary label space $mathsf Y = �,1$. Let $mathcal F$ be a class of functions $f : mathsf X rightarrow mathsf Y$, which we will call the hypothesis space. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of $mathcal F$ with a given distribution $Q$. We assume that the learning agent knows $Q$, so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs $(x_t,Y_t)$, where $t=0,1,2,ldots$. Here, each $x_t$ is an arbitrary point of $mathsf X$, and for now we will treat $bf x = (x_1,x_2,ldots)$ as an individual sequence. The labels $Y_t$, though, are random and given by $Y_t = F(x_t)$, where $F sim Q$ is the model drawn by Nature from $Q$.

The learning agent receives the training data sequentially: at each time $t=1,2,ldots$, it has access to $(x_1,Y_1),ldots,(x_t,Y_t)$. Let us denote the pair $(x_t,Y_t)$ by $Z_t$. The learning task is: given the next input $x_t+1$, predict the value $Y_t+1$. A learning algorithm $mathcal A$ is a sequence of possibly randomized mappings $a_t :(Z^t-1, x_t) mapsto widehatY_t$: at each time $t$, the learning agent uses all available training data to generate a prediction for the label $Y_t$ of $x_t$. After generating the prediction, the learning agent observes the true $Y_t$. We are interested in the probability that $mathcal A$ makes an error at time $t$:

$displaystyle e_bf x(mathcal A,t) = Qbig[widehatY_t neq Y_tbig] equiv Qbig[a_t(Z^t-1,x_t) neq Y_tbig].$

We will refer to the function $t mapsto e_bf x(mathcal A,t)$ as the learning curve of $mathcal A$ [note that it depends on the sequence $bf x$ of inputs supplied to the learning agent during learning].

How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:

$displaystyle F,widehatY_1,Z_1,widehatY_2,Z_2,ldots,widehatY_t,Z_t,ldots$

Here, $F$ has distribution $Q$, $widehatY_t$ are determined by the learning algorithm $mathcal A$, and $Z_t$ are determined by $x_t$ and $F$. Let us denote the joint distribution of all of these variables by $mathbb P$. When the learning agent observes $Z_t+1=z_t$ at time $t$, we can quantify its information gain relative to time $t-1$ by

$displaystyle rm IG(t) = D(mathbb P_Z^t=z^t | mathbb P_F) - D(mathbb P_F | mathbb P_F)$

It’s not hard to see that $rm IG(t)$ is nonnegative because $Z^t$ (information at time $t$) includes $Z^t-1$. The expected information gain is given by

$displaystyle mathbb E [rm IG(t)] = I(F; Z_t | Z^t-1),$

the conditional mutual information between $F$ and $Z_t$ given $Z^t-1$. For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the volume ratio, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.

To that end, let us first observe that, at time $t$, the agent can localize the unknown model $F$ to the set

$displaystyle mathcal V_t = left f in mathcal F: f(x_s) = Y_s, 1 le s le t right.$

This set is known in learning theory as the version space at time $t$. It is easy to see that the version spaces are nested: $mathcal V_t subseteq mathcal V_t-1 subseteq ldots subseteq mathcal V_1 subseteq mathcal V_0 equiv mathcal F$. Clearly, $F in mathcal V_t$ for all $t$, and for any $mathcal E subseteq mathcal F$

$displaystyle mathbb P[F in mathcal E|Z^t=z^t] = Q(mathcal E | mathcal V_t) = fracQ(mathcal E cap mathcal V_t)Q(mathcal V_t)$

This implies, in particular, that

$displaystyle D(mathbb P_Z^t=z^t | mathbb P_F) = log frac1Q(mathcal V_t),$

and therefore that

$displaystyle rm IG(t) = log fracQ(mathcal V_t-1)Q(mathcal V_t),$

where $Q(mathcal V_0) = Q(mathcal F)=1$. Since $F in bigcap_t mathcal V_t$, we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which $Q(mathcal V_t)$ decreases. For future convenience, let us define the volume ratio at time $t$:

$displaystyle xi_t = fracQ(mathcal V_t)Q(mathcal V_t-1).$

Note, by the way, that since $mathcal V_t subseteq mathcal V_t-1$, we have

$displaystyle xi_t = fracQ(mathcal V_t cap mathcal V_t-1)Q(mathcal V_t-1) = mathbb P[F in mathcal V_t | F in mathcal V_t-1].$

Consequently,

$displaystyle rm IG(t) = - log xi_t.$

We are going to use binary algorithms, so the units of all information quantities are bits.

Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:

1. The Bayes algorithm $mathcal A^rm Bayes$: at time $t$, it partitions the version space $mathcal V_t-1$ into two disjoint subsets

$displaystyle mathcal S^y_t = left f in mathcal V_t-1 : f(x_t) = y right, qquad y in �,1$

and then predicts

$displaystyle widehatY_t = bf 1leftQ(mathcal S^1_t)-Q(mathcal S^0_t) ge 0right.$

The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting $Y_t$ given all available information. However, it is an improper learning algorithm in the sense that the function $x_t mapsto widehatY_t$ will not, in general, be an element of $mathcal F$.

2. The Gibbs algorithm $mathcal A^rm Gibbs$: at time $t$, it draws a random $widehatF_t sim Q_F$ and predicts

$displaystyle widehatY_t = widehatF_t(x_t).$

The Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function $x_t mapsto widehatY_t$ does belong to $mathcal F$ by construction.

Now let’s analyze their learning curves:

1. The Bayes algorithm will make a mistake whenever $Q(mathcal V_t) < Q(mathcal V_t-1)/2$, i.e., whenever $xi_t < 1/2$. Indeed, using the fact that, by definition of the version space,

$displaystyle mathcal S^Y_t_t = left f in mathcal V_t-1 : f(x_t) = Y_t right equiv mathcal V_t,$

we have

$displaystyle Q(mathcal V_t) < Q(mathcal V_t-1)/2 qquad Longrightarrow qquad Q(mathcal S^Y_t_t) < fracQ(mathcal V_t-1)2 = fracQ(mathcal S^Y_t_t)+Q(mathcal S^1-Y_t_t)2.$

Thus, if $xi_t < 1/2$, then $Q(mathcal S^Y_t_t) 1/2$, the Bayes algorithm will be correct: $widehatY_t = Y_t$. Finally, if $xi_t = 1/2$, then the Bayes algorithm will make a mistake with probability $1/2$. Therefore,

$displaystyle e_bf x(mathcal A^rm Bayes,t) = mathbb E [vartheta(1/2-xi_t)],$

where we have defined the function

$displaystyle vartheta(u) = begincases 0, & u 0 endcases$

2. The Gibbs algorithm will make a mistake whenever $widehatF_t$ is not in the version space $mathcal V_t$. Since $widehatF_t$ is drawn from the posterior distribution of $F$ given $F in mathcal V_t-1$, we have

$displaystyle e_bf x(mathcal A^rm Gibbs,t) = mathbb Pleft[widehatF_t notin mathcal V_tright] = mathbb E [1-xi_t].$

Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio $xi_t$ is always bounded between $0$ and $1$. Moreover, we have the following useful formula: for any function $g : [0,1] rightarrow [0,1]$,

$displaystyle mathbb E [g(xi_t)] = mathbb E [xi_t g(xi_t) + (1-xi_t)g(1-xi_t)] (1)$

(the proof is by conditioning on $Z^t-1$ and using the law of iterated expectation).

Theorem 1 We have the following bounds for the Bayes and the Gibbs algorithms:

$displaystyle e_bf x(mathcal A^rm Gibbs,t) ge e_bf x(mathcal A^rm Bayes,t) ge h^-1left(I(F;Z_t|Z^t-1)right) (2)$

and

$displaystyle frac12e_bf x(mathcal A^rm Gibbs,t) le e_bf x(mathcal A^rm Bayes,t) le e_bf x(mathcal A^rm Gibbs,t) le fracI(F; Z_t 2, (3)$

where $h^-1 : [0,1] rightarrow [0,1/2]$ is the inverse of the binary entropy function $h(u) = -ulog u - (1-u)log(1-u)$.

Remark 1 Equation (2) shows that the error probability of the Bayes algorithm at time $t$ can be bounded below by a function of the expected information gain at time $t$. It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.

Proof: We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:

Using (1) with $g(u) = - log u$, we have

$displaystyle I(F; Z_t | Z^t-1) = mathbb E [-log xi_t] = mathbb E [xi_t log xi_t + (1-xi_t) log (1-xi_t)] = mathbb E [h(xi_t)].$

This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.

With $g(u) = vartheta(1/2-u)$, we have

$displaystyle e_bf x(mathcal A^rm Bayes,t) = mathbb E [vartheta(1/2-xi_t)] = mathbb E [xi_t vartheta(1/2-xi_t) + (1-xi_t) vartheta(xi_t - 1/2)] = mathbb Eleft[min\xi_t,1-xi_t\right],$

where we have used the fact that, for any $u in [0,1]$,

$displaystyle uvartheta(1/2-u) + (1-u)vartheta(u-1/2) = minu,1-u.$

Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to $p$. The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss $ell(y,widehaty) = bf 1y neq widehaty$, is to predict heads if $p ge 1-p$ and tails otherwise. The expected loss in that case is equal to $minp,1-p$. The problem of predicting $Y_t$ given $x_t$ and $Z^t-1=z^t-1$ is equivalent to the above guessing problem with $p = xi_t$.

Finally, with $g(u) = 1-u$, we have

$displaystyle e_bf x(mathcal A^rm Gibbs,t) = mathbb E [1-xi_t] = 2mathbb E [xi_t(1-xi_t)].$

Again, if we consider the problem of guessing the outcome of tossing a $rm Bernoulli(p)$ coin, we can think of the following suboptimal scheme: guess $H$ with probability $p$ and $T$ with probability $1-p$ (i.e., just mentally simulate the coin toss). The probability of error is exactly $2p(1-p)$, which is also twice the variance of a $rm Bernoulli(p)$ random variable. Reasoning conditionally on $Z^t-1=z^t-1$, we see that the Gibbs algorithm is exactly this scheme with $p = xi_t$.

We will also use the following chain of easy-to-prove inequalities: for any $u in [0,1]$,

$displaystyle u(1-u) le minu,1-u le 2u(1-u) le frach(u)2. (4)$

We start by proving the lower bound, Equation (2). First, note that, for any $u in [0,1]$,

$displaystyle h^-1(h(u)) = minu,1-u.$

Moreover, the function $h^-1 : [0,1] rightarrow [0,1]$ is convex. Therefore,

$displaystyle e_bf x(mathcal A^rm Bayes,t) = mathbb Eleft[min\xi_t,1-xi_t\right] = mathbb E [h^-1(h(xi_t))] ge h^-1left(mathbb E [h(xi_t)]right) =h^-1left(I(F;Z_t|Z^t-1)right).$

The upper bound, Equation (3), follows by taking $u=xi_t$ in (4) and then taking expectations. $Box$

Now let’s see what happens when we have to do this over and over again: at each time $t$, the learning agent gives a prediction $widehatY_t$, observes $Y_t$, updates $Z_t-1 rightarrow Z_t$, etc. What can we say about the expected fraction of mistakes? Given a learning algorithm $mathcal A$, the quantity of interest is

$displaystyle bare_bf x(mathcal A,T) = frac1Tsum^T_t=1 e_bf x(mathcal A,t),$

for each $T=1,2,ldots$. In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence $bf x$ of input instances, let $mathcal D_T$ denote the set of all distinct elements of the tuple $x^T = (x_1,ldots,x_T)$. Say that two hypotheses $f,f' in mathcal F$ are equivalent if $f(x)=f'(x)$ for all $x in mathcal D_T$. This defines an equivalence relation on $mathcal F$, so we can split it into equivalence classes $mathcal E_1,ldots,mathcal E_J$. Here, the number of equivalence classes $J$ depends on $bf x$ and on $T$. How many such classes can we have? Let us order the elements of $mathcal D_T$ arbitrarily, say by $x^(1),ldots,x^(N)$, where $N le T$. Each $f in mathcal F$ can be associated with a binary string $b(f,mathcal D_T) = (f(x^(1)),ldots,f(x^(N)))$. Then $f$ and $f'$ are in the same equivalence class if and only if they give rise to the same binary strings:

$displaystyle b(f,mathcal D_T) equiv b(f',mathcal D_T).$

Consequently,

$displaystyle J = left| left b(f,mathcal D_T) : f in mathcal F right right|.$

For any subset $mathcal D subseteq mathcal D_T$, let us define $b(f,mathcal D)$ in the obvious way: if $mathcal D = x^(j_1),ldots,x^(j_k))$, then we let $b(f,mathcal D) = (f(x^(j_1)),ldots,f(x^(j_k)))$. Let $d_T(bf x)$ be the largest integer $d le N$, such that

$displaystyle left| left b(f,mathcal D) : f in mathcal F right\right| = 2^d$

for all $mathcal D subseteq mathcal D_T$ with $= d$. A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that

$displaystyle J le sum^d_T(bf x)_j=0 N choose j le left(fracNed_T(bf x)right)^d_T(bf x) le left(fracTed_T(bf x)right)^d_T(bf x). (5)$

Now we are ready to state and prove the following:

Theorem 2

$displaystyle bare_bf x(mathcal A^rm Bayes,T) le bare_bf x(mathcal A^rm Gibbs,T) le fracd_T(bf x)2T log left(fracTed_T(bf x) right). (6)$

Proof: We have

$displaystyle beginarrayrcl bare_bf x(mathcal A^rm Bayes,T) &le& bare_bf x(mathcal A^rm Gibbs,T)\ &le& frac12Tsum^T_t=1 mathbb E [-log xi_t] \ &=& frac12Tsum^T_t=1 mathbb Eleft[log fracQ(mathcal V_t-1)Q(mathcal V_t)right] \ &=& frac12Tmathbb Eleft[log fracQ(mathcal V_0)Q(mathcal V_T)right] \ &=& frac12Tmathbb Eleft[log frac1Q(mathcal V_T)right]. endarray$

From the definition of the version spaces, it is immediate that $mathcal V_T$ is the equivalence class containing the “true” model $F sim Q$. In other words, there exists some random $j(F) in 1,ldots,J$, such that $mathcal V_T = mathcal E_j(F)$. Consequently, we can write

$displaystyle beginarrayrcl mathbb Eleft[log frac1Q(mathcal V_T)right] &=& sum^J_j=1 mathbb Eleft[bf 1F in mathcal E_j\logfrac1Q(mathcal V_T)right] \ &=& sum^J_j=1 mathbb Eleft[bf 1F in mathcal E_j\logfrac1Q(mathcal E_j)right] \ &=& -sum^J_j=1 Q(mathcal E_j) log Q(mathcal E_j). endarray$

We can recognize the quantity in the last line as the Shannon entropy of the index $j(F)$: that is,

$displaystyle beginarrayrcl bare_bf x(mathcal A^rm Bayes,T) &le& bare_bf x(mathcal A^rm Bayes,T) \ &le& fracH(j(F))2T. endarray$

Now, since $j(F)$ takes values in $1,ldots,J$, we can invoke the Sauer-Shelah estimate (5) to get

$displaystyle H(j(F)) le log J le d_T(bf x) log left(fracT ed_T(bf x)right).$

The claimed bound (6) follows. $Box$

If $mathcal F$ is a Vapnik-Chervonenkis class with VC dimension $d$, then we immediately obtain:

Corollary 3

$displaystyle bare_bf x(mathcal A^rm Bayes,T) le bare_bf x(mathcal A^rm Gibbs,T) le fracd2T log left(fracTed right).$