Machine learning is about enabling computers to improve their performance on a given task as they get more data. Can we express this intuition quantitatively using information-theoretic techniques? In this post, I will discuss a classic paper by David Haussler, Michael Kearns, and Robert Schapire that (to the best of my knowledge) took the first step in this direction. I will describe some of their results, recast in a more explicitly information-theoretic way.

We will stick to the setting of binary classification: we have a feature space (which is completely arbitrary) and a binary label space . Let be a class of functions , which we will call the *hypothesis space*. These functions are our models for the relationship between features and labels, which we wish to learn from data. In order to focus on the essential aspects of the problem, we will adopt the Bayesian approach and assume that the “true” model is a random element of with a given distribution . We assume that the learning agent knows , so we can think of this distribution as the learner’s prior. Training data consist of input-output pairs , where . Here, each is an arbitrary point of , and for now we will treat as an *individual sequence*. The labels , though, are random and given by , where is the model drawn by Nature from .

The learning agent receives the training data sequentially: at each time , it has access to . Let us denote the pair by . The learning task is: given the next input , predict the value . A *learning algorithm* is a sequence of possibly randomized mappings : at each time , the learning agent uses all available training data to generate a prediction for the label of . After generating the prediction, the learning agent observes the true . We are interested in the probability that makes an error at time :

We will refer to the function as the *learning curve* of [note that it depends on the sequence of inputs supplied to the learning agent during learning].

How can we measure the learner’s progress? Let’s write out the causal ordering of all the random variables generated in the process of learning:

Here, has distribution , are determined by the learning algorithm , and are determined by and . Let us denote the joint distribution of all of these variables by . When the learning agent observes at time , we can quantify its information gain relative to time by

It’s not hard to see that is nonnegative because (information at time ) includes . The expected information gain is given by

the conditional mutual information between and given . For our purposes, it will be convenient to express the information gain in terms of a certain quantity called the *volume ratio*, which quantifies the rate at which the learning agent can zero in on the true model as more observations keep arriving.

To that end, let us first observe that, at time , the agent can localize the unknown model to the set

This set is known in learning theory as the *version space* at time . It is easy to see that the version spaces are nested: . Clearly, for all , and for any

This implies, in particular, that

and therefore that

where . Since , we expect that each training sample shrinks the version space, and that the learning curve of any good learning algorithm should be controlled by the rate at which decreases. For future convenience, let us define the *volume ratio* at time :

Note, by the way, that since , we have

Consequently,

We are going to use binary algorithms, so the units of all information quantities are bits.

Now let’s get down to business. Following Haussler et al., we will consider the following two learning algorithms:

- The
*Bayes algorithm*: at time , it partitions the version space into two disjoint subsetsand then predicts

The Bayes algorithm is optimal in the sense that it minimizes the probability of incorrectly predicting given all available information. However, it is an

*improper*learning algorithm in the sense that the function will not, in general, be an element of . - The
*Gibbs algorithm*: at time , it draws a random and predictsThe Gibbs algorithm is suboptimal, but, unlike the Bayes algorithm, the function does belong to by construction.

Now let’s analyze their learning curves:

- The Bayes algorithm will make a mistake whenever , i.e., whenever . Indeed, using the fact that, by definition of the version space,
we have

Thus, if , then , the Bayes algorithm will be correct: . Finally, if , then the Bayes algorithm will make a mistake with probability . Therefore,

where we have defined the function

- The Gibbs algorithm will make a mistake whenever is not in the version space . Since is drawn from the posterior distribution of given , we have

Before stating and proving the main result, we need to collect some preliminary facts. First of all, the volume ratio is always bounded between and . Moreover, we have the following useful formula: for any function ,

(the proof is by conditioning on and using the law of iterated expectation).

Theorem 1We have the following bounds for the Bayes and the Gibbs algorithms:

Remark 1Equation (2) shows that the error probability of the Bayes algorithm at time can be bounded below by a function of the expected information gain at time . It also shows that the error probability of the Gibbs algorithm is at least as large as that of the Bayes algorithm. On the other hand, Equation (3) gives an upper bound on the error performance of the Bayes algorithm in terms of the expected information gain, and it also shows that the error of the Gibbs algorithm is at most twice as large as that of the Bayes algorithm.

*Proof:* We start by obtaining the following alternative expressions for the error probabilities of the two algorithms, as well as for the expected information gain:

Using (1) with , we have

This shows, in particular, that the average information gain due to receiving another training sample is no more than one bit.

With , we have

where we have used the fact that, for any ,

Another way to arrive at this expression is by using the following fact: Suppose we have a biased coin with probability of heads equal to . The best prediction of the outcome of a single toss of that coin, with respect to the Hamming loss , is to predict heads if and tails otherwise. The expected loss in that case is equal to . The problem of predicting given and is equivalent to the above guessing problem with .

Finally, with , we have

Again, if we consider the problem of guessing the outcome of tossing a coin, we can think of the following suboptimal scheme: guess with probability and with probability (i.e., just mentally simulate the coin toss). The probability of error is exactly , which is also twice the variance of a random variable. Reasoning conditionally on , we see that the Gibbs algorithm is exactly this scheme with .

We will also use the following chain of easy-to-prove inequalities: for any ,

We start by proving the lower bound, Equation (2). First, note that, for any ,

Moreover, the function is convex. Therefore,

The upper bound, Equation (3), follows by taking in (4) and then taking expectations.

Now let’s see what happens when we have to do this over and over again: at each time , the learning agent gives a prediction , observes , updates , etc. What can we say about the expected fraction of mistakes? Given a learning algorithm , the quantity of interest is

for each . In order to get a handle on this quantity, we will need some Vapnik-Chervonenkis combinatorics. Given the sequence of input instances, let denote the set of all *distinct* elements of the tuple . Say that two hypotheses are equivalent if for all . This defines an equivalence relation on , so we can split it into equivalence classes . Here, the number of equivalence classes depends on and on . How many such classes can we have? Let us order the elements of arbitrarily, say by , where . Each can be associated with a binary string . Then and are in the same equivalence class if and only if they give rise to the same binary strings:

Consequently,

For any subset , let us define in the obvious way: if , then we let . Let be the largest integer , such that

for all with . A deep result, commonly known as the Sauer-Shelah lemma (see here for some fascinating history), then says that

Now we are ready to state and prove the following:

*Proof:* We have

From the definition of the version spaces, it is immediate that is the equivalence class containing the “true” model . In other words, there exists some random , such that . Consequently, we can write

We can recognize the quantity in the last line as the Shannon entropy of the index : that is,

Now, since takes values in , we can invoke the Sauer-Shelah estimate (5) to get

The claimed bound (6) follows.

If is a Vapnik-Chervonenkis class with VC dimension , then we immediately obtain:

Corollary 3