ISIT 2013: two plenaries on concentration of measure

Of the five plenary talks at this year’s ISIT, two were about concentration of measure: Katalin Marton’s Shannon lecture on “Distance-divergence inequalities” and Gabor Lugosi’s talk on “Concentration inequalities and the entropy method” the next morning. Since the topic of measure concentration is dear to my heart, I thought I would write down a few unifying themes.

When we speak of concentration of measure, we mean the following. Let be a random variable taking values in some metric space mathsf X. Then we say that X (or, more correctly, the distribution of X) has the concentration property if, for any set A subset mathsf X such that mathbb P(X in A) ge 1/2, we have

displaystyle  mathbb Pleft( d(X,A) le rright) xrightarrowr rightarrow infty 1.      (1)

Here, d(x,A) is the distance from the point x in mathsf X to the set A:

displaystyle  d(x,A) := inf_y in A d(x,y).

Another way to express (1) is as follows: for any set A subset mathsf X and any r ge 0, define the r-blowup of A by

displaystyle  A_r := left y in mathsf X: d(y,A) le r right equiv left y in mathsf X: exists x in A text such that  d(x,y) le r right.

Then X has the concentration property if

displaystyle  mathbb P(X in A) ge 1/2 quad Longrightarrow quad lim_r rightarrow infty mathbb P(X in A_r) = 1.

In other words, X has the concentration property if any set containing X with not too small a probability can be “blown up” to contain X with near-certainty.

Here are two classic examples of concentration:

  1. Gaussian distribution in Euclidean space. Let mathsf X = mathbb R^n and take  x - y — the usual Euclidean distance. Let X be a standard n-dimensional Gaussian random variable, i.e., X sim N(0,I_n), where I_n is the ntimes n identity matrix. Then for any r ge 0 we have

    displaystyle  	mathbb P(X in A) ge 1/2 quad Longrightarrow quad mathbb P(X in A_r) ge 1 - e^-r^2/2.

  2. Uniform distribution in Hamming space. Let mathsf X be the Hamming cube �,1^n equipped with the normalized Hamming distance

    displaystyle  	d(x,y) = frac1nsum^n_i=11_x_i neq y_i

    that counts the fraction of bits in which x = (x_1,ldots,x_n) and y = (y_1,ldots,y_n) disagree. Let X have the uniform distribution on �,1^n, i.e., mathbb P(X=x) = 2^-n for all x. Then

    displaystyle  	mathbb P(X in A) ge 1/2 quad Longrightarrow quad mathbb P(X in A_r) ge 1 - e^-2nr^2.

These two examples suggest that we should aim for “hard” statements in the form of sharp bounds on the concentration function

displaystyle  alpha_X(r) := sup_A:, mathbb P(X in A) ge 1/2 mathbb P(X notin A_r)

as opposed to merely “soft” statements of the form alpha_X(r) rightarrow 0 as r rightarrow infty. The $64,000 question is: how do we get such bounds?

There are two ways to accomplish this goal, and the main idea underlying these two ways is to replace sets with some other objects that are hopefully easier to handle. The first way is to replace sets by probability measures, the second is to replace them by functions. Here is what I mean:

Fix a set A subset mathsf X with Pr(X in A) > 0. Let P denote the distribution of X, and let P_A denote the conditional distribution of X given X in A. That is, for any (measurable) set B subset mathsf X we have

displaystyle  P_A(B) := fracP(A cap B)P(A).

I am using the subscript notation P_A instead of the more usual P(cdot to indicate the fact that P_A is a probability measure in its own right. In this way, we can associate to each non-null set A a probability measure P_A. Now, here is a very simple observation that turns out to be very consequential:

displaystyle  	D(P_A | P) = log frac1P(A).      (2)

This is very easy to prove: for any set B we have

displaystyle  	P_A(B) = frac1P(A)int_B 1_A(x) P(dx),

so P_A is absolutely continuous with respect to P with the Radon-Nikodym derivative dP_A/dP = 1_A/P(A). Therefore, by definition of the divergence,

displaystyle  D(P_A | P) = int dP_A log fracdP_AdP = frac1P(A)int_A dP logfrac1P(A) = log frac1P(A).

So if we are interested in bounding the probabilities of various sets A, we may hope to get some mileage out of the relationship (2).

On the other hand, we may also associate to a set A with P(A) > 0 the function

displaystyle  f_A(x) := d(x,A) equiv inf_y in A d(x,y).

This function is Lipschitz: for any x,x' in mathsf X,

displaystyle  f_A(x) - f_A(x') = inf_y in Ad(x,y) - inf_y in A d(x',y) le sup_y in A left[d(x,y) - d(x',y)right] le d(x,x'),

where the last step is by the triangle inequality. Interchanging the roles of x and x', we get the Lipschitz property. Moreover, let us consider the random variable Z = f_A(X), where X is our original mathsf X-valued random variable. Then we immediately notice two things:

  1. For any r ge 0, mathbb P(Z le r) = mathbb Pleft(d(X,A) le rright) = mathbb P(A_r).
  2. If P(A) = mathbb P(X in A) ge 1/2, then 0 is a median of Z, in the sense that

    displaystyle  	mathbb P(Z le 0) = P(A) ge 1/2 qquad textand qquad mathbb P(Z > 0) ge 1/2

    (the second inequality is obviously true since Z is nonnegative with probability 1).

These two observations suggest that we may obtain concentration bounds by bounding the probability that a given Lipschitz function of X deviates from its median by more than r. In fact, it is easy to derive an alternative expression for the concentration function alpha_X:

displaystyle  	alpha_X(r) = sup_rm 1-Lipschitz, f mathbb PBig( f(X) > m_f + rBig),      (3)

where m_f denotes any median of f(X). We already showed, by passing from A to f_A = d(cdot,A), that alpha_X is bounded from above by the quantity on the right-hand side of (3):

1-Lipschitz function f and consider the set A_f : = left x in mathsf X : f(x) le m_f right, where m_f is any median of f. Then, by definition,

displaystyle  mathbb P(X in A_f) = mathbb PBig( f(X) le m_f Big) ge 1/2.

Moreover, if we consider the r-blowup

displaystyle  [A_f]_r = Big x in mathsf X: d(x,A_f) le r Big,

then for any x in mathsf X and any y in A_f we must have

displaystyle  f(x) - m_f le f(x) - f(y) le d(x,y),

where the last step is by the Lipschitz property of f. Consequently, by definition of the concentration function,

displaystyle  mathbb PBig( f(X) > m_f + r Big) le mathbb PBig( d(X,A_f) > r Big) = 1 - Pleft([A_f]_rright) le alpha_X(r).

By passing to the functional viewpoint, we obtain another equivalent characterization of the concentration property: a random variable X taking values in a metric space (mathsf X,d) has the concentration property if real-valued Lipschitz functions X are “nearly constant.”

Let’s look at the first, probabilistic viewpoint, which was born out of a 1996 breakthrough paper by Marton. Given a metric space (mathsf X,d), let us define the L_1 Wasserstein distance (or transportation distance) between any two probability measures P and Q on it:

displaystyle  W_1(P,Q) := inf_X sim P, Y sim Q mathbb E[d(X,Y)],

where the infimum is over all jointly distributed random variables X,Y in mathsf X, such that P_X = P and P_Y = Q. Now consider a random variable X in mathsf X, for which we wish to establish concentration. What Marton showed is the following: Suppose the distribution P of X satisfies the L_1 transportation inequality

displaystyle  W_1(P,Q) le sqrt2c, D(Q       (4)

for some constant c > 0. Then X has the concentration property, and moreover

A,B with P(A),P(B) neq 0. Recalling our notation for conditional distributions, we can write

displaystyle  beginarrayrcl  W_1(P_A,P_B) &le& W_1(P_A,P) + W_1(P_B,P) \ &le& sqrt2c, D(P_A  + sqrt2c, D(P_B  \ &=& sqrt2c log frac1P(A) + sqrt2c log frac1P(B), endarray

where in the first step we have used the triangle inequality, in the second we have used the fact that P satisfies the transportation inequality (4), and in the last step we have used the formula (2). Now suppose that P(A) ge 1/2 and let B = A^c_r for some r, where c denotes set-theoretic complement. Then we can show that W_1(P_A,P_B) ge d(A,B) ge r. On the other hand,

displaystyle  log frac1P(A) le log 2 qquad textand qquad logfrac1P(B) = logfrac11-P(A_r).

Combining these facts gives us the bound

displaystyle  r le sqrt2c log 2 + sqrt2c log frac11-P(A_r)

that holds for all r. If r > sqrt2 c log 2, then we get

displaystyle  P(A_r) ge 1 - expleft( - frac12cleft(r - sqrt2c log 2right)^2right),

so we indeed have concentration and a sharp bound on alpha_X(r), at least for large enough values of r. The main message here is that, in order to study concentration, it suffices to work on the level of probability measures and to focus one’s effort on showing that the distribution of X satisfies a suitable transportation inequality. Since Marton’s original work, there have been many refinements and extensions, which I will not go into here. One such result, due to Sergey Bobkov and Friedrich Götze, says that P satisfying a transportation inequality (4) is equivalent to the Gaussian concentration property

displaystyle  alpha_X(r) le e^-r^2/2c, qquad forall r ge 0.

Now let’s look at the complementary functional viewpoint. Recall that we seek tight upper bounds on deviation probabilities of the form

displaystyle  mathbb PBig( f(X) ge m_f + r Big), qquad forall r > 0.

It is easier to work with means instead of medians, and indeed it can be shown that concentration around the mean is equivalent to concentration around any median. So let’s focus on the mean. Let X, as before, be a random variable over some metric space (mathsf X,d), and consider a Lipschitz function f : mathsf X rightarrow mathbb R such that mathbb E[f(X)] = 0. We can apply the well-known Chernoff trick: for any r,lambda > 0 we have

displaystyle  mathbb PBig( f(X) ge r Big) = mathbb PBig( e^lambda f(X) ge e^lambda r Big) le e^-lambda r mathbb E[e^lambda f(X)].

Now the whole affair hinges on the availability of tight upper bounds on the logarithmic moment-generating function Lambda(lambda) := log mathbb E[e^lambda f(X)]. The entropy method, which was the subject of Gabor Lugosi’s plenary, is the name for a set of techniques for bounding Lambda(lambda) by means of various inequalities involving the relative entropy between various tilted distributions derived from P and P itself. The entropy method has roots in the work of Michel Ledoux, who in turn distilled it from some very deep results of Michel Talagrand.

The simplest version of the entropy method goes something like this. Let us define, for any t in mathbb R, the tilted distribution P^(t) via

displaystyle  fracdP^(t)dP(x) = frace^tf(x)e^Lambda(t)

(assuming, of course, that Lambda(t) exists and is finite). Then we have

displaystyle  beginarrayrcl  	D(P^(t) | P) &=& int dP^(t) left[ tf - Lambda(t)right] \ 	&=& frac1e^Lambda(t) t, mathbb Eleft[f(X)e^tf(X)right] - Lambda(t) \ 	&=& t Lambda'(t) - Lambda (t) \ 	&=& t^2 fracddt left(fracLambda(t)tright). endarray

Integrating and using the fact that Lambda(0) = 0, we get

displaystyle  Lambda(lambda) = lambda int^lambda_0 fracD(P^(t) t^2 dt.      (5)

Now suppose that we can bound  P) le ct^2/2 for some c > 0. Then from (5) we have

displaystyle  Lambda(lambda) le fracclambda^22, qquad forall t

which in turn gives the Gaussian tail bound

displaystyle  mathbb PBig( f(X) ge r Big) le inf_lambda > 0 expleft(-lambda r + fracclambda^22right) = expleft(-fracr^22cright).

This is the so-called Herbst argument. Of course, I have glossed over the most nontrivial part — namely, showing that we can bound the relative entropy D(P^(t) by a quadratic function of t. Gabor’s talk contained numerous examples of how one could get such bounds in the special case when X is an n-tuple of independent random variables taking values in some base space mathcal X, X = (X_1,ldots,X_n) in mathcal X^n, and the function f is not “too sensitive” to local modifications of its arguments. This takes us back to probability on metric spaces.

Here is one classic example. Suppose that our function f has the bounded difference property, i.e., there exist some constants c_1,ldots,c_n ge 0, such that changing the ith argument of f while keeping others constant will change the value of f by at most c_i:

displaystyle  sup_x_1,ldots,x_nsup_x'_ileft| f(x_1,ldots,x_i,ldots,x_n) - f(x_1,ldots,x'_i,ldots,x_n) right| le c_i.

We can express this more succinctly as a Lipschitz property of f if we define the weighted Hamming metric

displaystyle  d(x,y) = sum^n_i=1c_i 1_x_i neq y_i      (6)

(we can assume without loss of generality that all the c_i‘s are strictly positive, because we can simply ignore those coordinates of x that do not affect the value of f). Then the bounded difference property is equivalent to saying that f is 1-Lipschitz with respect to this weighted Hamming metric. Moreover, it is possible to show that any product probability measure P = P_1 otimes ldots otimes P_n on the product space mathsf X = mathcal X^n satisfies the transportation inequality

displaystyle  W_1(Q,P) le sqrt2c, D(Q ,

where the Wasserstein distance is computed with respect to the weighted Hamming metric (6), and c = frac14sum^n_i=1c^2_i. By the Bobkov–Götze result quoted above, this is equivalent to the concentration bound

displaystyle  mathbb Pleft( f(X) ge mathbb E[f(X)] + r right) le expleft( - fracr^22cright) = expleft( - frac2r^2sum^n_i=1c^2_iright).

This is the well-known McDiarmid’s inequality, which was originally derived using martingale techniques, but here we have arrived at it through a completely different route that took us back to where we started: concentration of Lipschitz functions around their means and/or medians, which (as we saw) is the same thing as the original, “stochastic-geometric” view of the concentration of measure phenomenon.