Counting bits with Vapnik and Chervonenkis

Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. I will describe some of their results, recast in a more explicitly information-theoretic way.

We will stick to the setting of binary classification: we have a feature space (which is completely arbitrary) and a binary label space mathsf Y = �,1. Let mathcal F be a class of functions f : mathsf X rightarrow mathsf Y, which we will call the hypothesis space. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of mathcal F with a given distribution Q. We assume that the learning agent knows Q, so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs (x_t,Y_t), where t=0,1,2,ldots. Here, each x_t is an arbitrary point of mathsf X, and for now we will treat bf x = (x_1,x_2,ldots) as an individual sequence. The labels Y_t, though, are random and given by Y_t = F(x_t), where F sim Q is the model drawn by Nature from Q.

The learning agent receives the training data sequentially: at each time t=1,2,ldots, it has access to (x_1,Y_1),ldots,(x_t,Y_t). Let us denote the pair (x_t,Y_t) by Z_t. The learning task is: given the next input x_t+1, predict the value Y_t+1. A learning algorithm mathcal A is a sequence of possibly randomized mappings a_t :(Z^t-1, x_t) mapsto widehatY_t: at each time t, the learning agent uses all available training data to generate a prediction for the label Y_t of x_t. After generating the prediction, the learning agent observes the true Y_t. We are interested in the probability that mathcal A makes an error at time t:

displaystyle  e_bf x(mathcal A,t) = Qbig[widehatY_t neq Y_tbig] equiv Qbig[a_t(Z^t-1,x_t) neq Y_tbig].

We will refer to the function t mapsto e_bf x(mathcal A,t) as the learning curve of mathcal A [note that it depends on the sequence bf x of inputs supplied to the learning agent during learning].

How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:

displaystyle  F,widehatY_1,Z_1,widehatY_2,Z_2,ldots,widehatY_t,Z_t,ldots

Here, F has distribution Q, widehatY_t are determined by the learning algorithm mathcal A, and Z_t are determined by x_t and F. Let us denote the joint distribution of all of these variables by mathbb P. When the learning agent observes Z_t+1=z_t at time t, we can quantify its information gain relative to time t-1 by

displaystyle  rm IG(t) = D(mathbb P_Z^t=z^t | mathbb P_F) - D(mathbb P_F | mathbb P_F)

It’s not hard to see that rm IG(t) is nonnegative because Z^t (information at time t) includes Z^t-1. The expected information gain is given by

displaystyle  mathbb E [rm IG(t)] = I(F; Z_t | Z^t-1),

the conditional mutual information between F and Z_t given Z^t-1. For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the volume ratio, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.

To that end, let us first observe that, at time t, the agent can localize the unknown model F to the set

displaystyle  mathcal V_t = left f in mathcal F: f(x_s) = Y_s, 1 le s le t right.

This set is known in learning theory as the version space at time t. It is easy to see that the version spaces are nested: mathcal V_t subseteq mathcal V_t-1 subseteq ldots subseteq mathcal V_1 subseteq mathcal V_0 equiv mathcal F. Clearly, F in mathcal V_t for all t, and for any mathcal E subseteq mathcal F

displaystyle  	mathbb P[F in mathcal E|Z^t=z^t] = Q(mathcal E | mathcal V_t) = fracQ(mathcal E cap mathcal V_t)Q(mathcal V_t)

This implies, in particular, that

displaystyle  	D(mathbb P_Z^t=z^t | mathbb P_F) = log frac1Q(mathcal V_t),

and therefore that

displaystyle  	rm IG(t) = log fracQ(mathcal V_t-1)Q(mathcal V_t),

where Q(mathcal V_0) = Q(mathcal F)=1. Since F in bigcap_t mathcal V_t, we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which Q(mathcal V_t) decreases. For future convenience, let us define the volume ratio at time t:

displaystyle  	xi_t = fracQ(mathcal V_t)Q(mathcal V_t-1).

Note, by the way, that since mathcal V_t subseteq mathcal V_t-1, we have

displaystyle  	xi_t = fracQ(mathcal V_t cap mathcal V_t-1)Q(mathcal V_t-1) = mathbb P[F in mathcal V_t | F in mathcal V_t-1].


displaystyle  	rm IG(t) = - log xi_t.

We are going to use binary algorithms, so the units of all information quantities are bits.

Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:

  1. The Bayes algorithm mathcal A^rm Bayes: at time t, it partitions the version space mathcal V_t-1 into two disjoint subsets

    displaystyle  		mathcal S^y_t = left f in mathcal V_t-1 : f(x_t) = y right, qquad y in �,1

    and then predicts

    displaystyle  		widehatY_t = bf 1leftQ(mathcal S^1_t)-Q(mathcal S^0_t) ge 0right.

    The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting Y_t given all available information. However, it is an improper learning algorithm in the sense that the function x_t mapsto widehatY_t will not, in general, be an element of mathcal F.

  2. The Gibbs algorithm mathcal A^rm Gibbs: at time t, it draws a random widehatF_t sim Q_F and predicts

    displaystyle  		widehatY_t = widehatF_t(x_t).

    The Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function x_t mapsto widehatY_t does belong to mathcal F by construction.

Now let’s analyze their learning curves:

  1. The Bayes algorithm will make a mistake whenever Q(mathcal V_t) < Q(mathcal V_t-1)/2, i.e., whenever xi_t < 1/2. Indeed, using the fact that, by definition of the version space,

    displaystyle  	mathcal S^Y_t_t = left f in mathcal V_t-1 : f(x_t) = Y_t right equiv mathcal V_t,

    we have

    displaystyle  	Q(mathcal V_t) < Q(mathcal V_t-1)/2 qquad Longrightarrow qquad Q(mathcal S^Y_t_t) < fracQ(mathcal V_t-1)2 = fracQ(mathcal S^Y_t_t)+Q(mathcal S^1-Y_t_t)2.

    Thus, if xi_t < 1/2, then Q(mathcal S^Y_t_t) <Q> 1/2, the Bayes algorithm will be correct: widehatY_t = Y_t. Finally, if xi_t = 1/2, then the Bayes algorithm will make a mistake with probability 1/2. Therefore,

    displaystyle  		e_bf x(mathcal A^rm Bayes,t) = mathbb E [vartheta(1/2-xi_t)],

    where we have defined the function

    displaystyle  		vartheta(u) = begincases 		0, & u  0 	endcases

  2. The Gibbs algorithm will make a mistake whenever widehatF_t is not in the version space mathcal V_t. Since widehatF_t is drawn from the posterior distribution of F given F in mathcal V_t-1, we have

    displaystyle  		e_bf x(mathcal A^rm Gibbs,t) = mathbb Pleft[widehatF_t notin mathcal V_tright] = mathbb E [1-xi_t].

Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio xi_t is always bounded between 0 and 1. Moreover, we have the following useful formula: for any function g : [0,1] rightarrow [0,1],

displaystyle  mathbb E [g(xi_t)] = mathbb E [xi_t g(xi_t) + (1-xi_t)g(1-xi_t)]      (1)

(the proof is by conditioning on Z^t-1 and using the law of iterated expectation).

Theorem 1 We have the following bounds for the Bayes and the Gibbs algorithms:

displaystyle  e_bf x(mathcal A^rm Gibbs,t) ge e_bf x(mathcal A^rm Bayes,t) ge h^-1left(I(F;Z_t|Z^t-1)right)       (2)


displaystyle  frac12e_bf x(mathcal A^rm Gibbs,t) le e_bf x(mathcal A^rm Bayes,t) le e_bf x(mathcal A^rm Gibbs,t) le fracI(F; Z_t 2,       (3)

where h^-1 : [0,1] rightarrow [0,1/2] is the inverse of the binary entropy function h(u) = -ulog u - (1-u)log(1-u).

Remark 1 Equation (2) shows that the error probability of the Bayes algorithm at time t can be bounded below by a function of the expected information gain at time t. It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.

Proof: We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:

Using (1) with g(u) = - log u, we have

displaystyle  			I(F; Z_t | Z^t-1) = mathbb E [-log xi_t] = mathbb E [xi_t log xi_t + (1-xi_t) log (1-xi_t)] = mathbb E [h(xi_t)].

This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.

With g(u) = vartheta(1/2-u), we have

displaystyle  			e_bf x(mathcal A^rm Bayes,t) = mathbb E [vartheta(1/2-xi_t)] = mathbb E [xi_t vartheta(1/2-xi_t) + (1-xi_t) vartheta(xi_t - 1/2)] = mathbb Eleft[min\xi_t,1-xi_t\right],

where we have used the fact that, for any u in [0,1],

displaystyle  		uvartheta(1/2-u) + (1-u)vartheta(u-1/2) = minu,1-u.

Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to p. The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss ell(y,widehaty) = bf 1y neq widehaty, is to predict heads if p ge 1-p and tails otherwise. The expected loss in that case is equal to minp,1-p. The problem of predicting Y_t given x_t and Z^t-1=z^t-1 is equivalent to the above guessing problem with p = xi_t.

Finally, with g(u) = 1-u, we have

displaystyle  			e_bf x(mathcal A^rm Gibbs,t) = mathbb E [1-xi_t] = 2mathbb E [xi_t(1-xi_t)].

Again, if we consider the problem of guessing the outcome of tossing a rm Bernoulli(p) coin, we can think of the following suboptimal scheme: guess H with probability p and T with probability 1-p (i.e., just mentally simulate the coin toss). The probability of error is exactly 2p(1-p), which is also twice the variance of a rm Bernoulli(p) random variable. Reasoning conditionally on Z^t-1=z^t-1, we see that the Gibbs algorithm is exactly this scheme with p = xi_t.

We will also use the following chain of easy-to-prove inequalities: for any u in [0,1],

displaystyle  	u(1-u) le minu,1-u le 2u(1-u) le frach(u)2.      (4)

We start by proving the lower bound, Equation (2). First, note that, for any u in [0,1],

displaystyle  	h^-1(h(u)) = minu,1-u.

Moreover, the function h^-1 : [0,1] rightarrow [0,1] is convex. Therefore,

displaystyle  	e_bf x(mathcal A^rm Bayes,t) = mathbb Eleft[min\xi_t,1-xi_t\right] = mathbb E [h^-1(h(xi_t))] ge h^-1left(mathbb E [h(xi_t)]right) =h^-1left(I(F;Z_t|Z^t-1)right).

The upper bound, Equation (3), follows by taking u=xi_t in (4) and then taking expectations. Box

Now let’s see what happens when we have to do this over and over again: at each time t, the learning agent gives a prediction widehatY_t, observes Y_t, updates Z_t-1 rightarrow Z_t, etc. What can we say about the expected fraction of mistakes? Given a learning algorithm mathcal A, the quantity of interest is

displaystyle  bare_bf x(mathcal A,T) = frac1Tsum^T_t=1 e_bf x(mathcal A,t),

for each T=1,2,ldots. In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence bf x of input instances, let mathcal D_T denote the set of all distinct elements of the tuple x^T = (x_1,ldots,x_T). Say that two hypotheses f,f' in mathcal F are equivalent if f(x)=f'(x) for all x in mathcal D_T. This defines an equivalence relation on mathcal F, so we can split it into equivalence classes mathcal E_1,ldots,mathcal E_J. Here, the number of equivalence classes J depends on bf x and on T. How many such classes can we have? Let us order the elements of mathcal D_T arbitrarily, say by x^(1),ldots,x^(N), where N le T. Each f in mathcal F can be associated with a binary string b(f,mathcal D_T) = (f(x^(1)),ldots,f(x^(N))). Then f and f' are in the same equivalence class if and only if they give rise to the same binary strings:

displaystyle  b(f,mathcal D_T) equiv b(f',mathcal D_T).


displaystyle  J = left| left b(f,mathcal D_T) : f in mathcal F right right|.

For any subset mathcal D subseteq mathcal D_T, let us define b(f,mathcal D) in the obvious way: if mathcal D = x^(j_1),ldots,x^(j_k)), then we let b(f,mathcal D) = (f(x^(j_1)),ldots,f(x^(j_k))). Let d_T(bf x) be the largest integer d le N, such that

displaystyle  left| left b(f,mathcal D) : f in mathcal F right\right| = 2^d

for all mathcal D subseteq mathcal D_T with  = d. A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that

displaystyle  	J le sum^d_T(bf x)_j=0 N choose j le left(fracNed_T(bf x)right)^d_T(bf x) le left(fracTed_T(bf x)right)^d_T(bf x).      (5)

Now we are ready to state and prove the following:

Theorem 2

displaystyle  		bare_bf x(mathcal A^rm Bayes,T) le bare_bf x(mathcal A^rm Gibbs,T) le fracd_T(bf x)2T log left(fracTed_T(bf x) right). 	     (6)

Proof: We have

displaystyle  beginarrayrcl  		bare_bf x(mathcal A^rm Bayes,T) &le& bare_bf x(mathcal A^rm Gibbs,T)\ 		&le& frac12Tsum^T_t=1 mathbb E [-log xi_t] \ 		&=& frac12Tsum^T_t=1 mathbb Eleft[log fracQ(mathcal V_t-1)Q(mathcal V_t)right] \ 		&=& frac12Tmathbb Eleft[log fracQ(mathcal V_0)Q(mathcal V_T)right] \ 		&=& frac12Tmathbb Eleft[log frac1Q(mathcal V_T)right]. 	endarray

From the definition of the version spaces, it is immediate that mathcal V_T is the equivalence class containing the “true” model F sim Q. In other words, there exists some random j(F) in 1,ldots,J, such that mathcal V_T = mathcal E_j(F). Consequently, we can write

displaystyle  beginarrayrcl  		mathbb Eleft[log frac1Q(mathcal V_T)right] &=& sum^J_j=1 mathbb Eleft[bf 1F in mathcal E_j\logfrac1Q(mathcal V_T)right] \ 		&=& sum^J_j=1 mathbb Eleft[bf 1F in mathcal E_j\logfrac1Q(mathcal E_j)right] \ 		&=& -sum^J_j=1 Q(mathcal E_j) log Q(mathcal E_j). 	endarray

We can recognize the quantity in the last line as the Shannon entropy of the index j(F): that is,

displaystyle  beginarrayrcl  	bare_bf x(mathcal A^rm Bayes,T) &le& bare_bf x(mathcal A^rm Bayes,T) \ 	&le& fracH(j(F))2T. endarray

Now, since j(F) takes values in 1,ldots,J, we can invoke the Sauer-Shelah estimate (5) to get

displaystyle  H(j(F)) le log J le d_T(bf x) log left(fracT ed_T(bf x)right).

The claimed bound (6) follows. Box

If mathcal F is a Vapnik-Chervonenkis class with VC dimension d, then we immediately obtain:

Corollary 3

displaystyle  		bare_bf x(mathcal A^rm Bayes,T) le bare_bf x(mathcal A^rm Gibbs,T) le fracd2T log left(fracTed right).