Conditional mutual information and the best Markov approximation

I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.

Consider three jointly distributed random variables . If the conditional mutual information Z) = 0, then X and Y are conditionally independent given Z (i.e., X rightarrow Z rightarrow Y is a Markov chain). In fact, Z) = 0 if and only if X rightarrow Z rightarrow Y. Now, what if I(X;Y is nonzero, but very small? Intuitively, one would expect that the joint distribution P_XYZ is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information I(X;Y is the divergence between P_XYZ and the nearest distribution Q_XYZ under which X rightarrow Z rightarrow Y is a Markov chain.

Theorem 1 Let X,Y,Z be three jointly distributed random variables. Then

displaystyle  		I(X;Y|Z) = min left Q_XYZ) : X rightarrow Z rightarrow Y text under  Q right

Moreover, the minimum is achieved by

displaystyle  		Q^*_XYZ(x,y,z) = P_Z(z)P_X(x|z)P_Z(y|z).

Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution Q_XYZ for which X rightarrow Z rightarrow Y. Then

displaystyle  beginarrayrcl  		D(P_XYZ | Q_XYZ) &=&, mathbb Eleft[ log fracP_XYZ(X,Y,Z)Q_XYZ(X,Y,Z)right] \ 		&=&, mathbb Eleft[ log left(fracP_Z(Z)Q_Z(Z) cdot fracP_X(XZ) cdot fracX,Z)Q_Z(Yright)right] \ 		&=&, mathbb Eleft[log fracP_Z(Z)Q_Z(Z)right] + mathbb Eleft[log fracP_X(XZ)right] + mathbb Eleft[log fracX,Z)Q_Z(Yright]. 	endarray

The first two terms on the right-hand side are equal, respectively, to  Q_Z) and  Q_X . For the third term, we can write

displaystyle  beginarrayrcl  		mathbb Eleft[log fracX,Z)Q_Z(Yright] &=&, mathbb Eleft[ log fracP_XYZ(X,Y,Z)P_XZ(X,Z)Q_Z(Yright] \ 		&=&, mathbb Eleft[ log left( fracP_X(XP_X(X cdot fracZ)Q_Z(Yright)right] \ 		&=&, I(X;Y | Z) + D(P_Z | Q_Z | P_Z). 	endarray

Putting everything together, we have

displaystyle  		D(P_XYZ | Q_XYZ) = D(P_Z | Q_Z) + D(P_X | Q_X | P_Z) + D(P_Z | Q_Z | P_Z) + I(X;Y|Z) 	     (1)

Now we observe two things:

  1. The right-hand side of (1) is greater than or equal to I(X;Y.
  2. The first three terms on the right-hand side are zero if and only if Q_XYZ is such that P_Z = Q_Z and P_X = Q_X, P_Z=z = Q_Z=z for all z that satisfy P_Z(z) > 0.

This concludes the proof. Box

Here is one interesting consequence. Consider three random variables X, Y and Z, where . Let

displaystyle  		rm MMSE(X|Y) = inf_f mathbb Eleft[(X-f(Y))^2right] = mathbb Eleft[left(X - mathbb E[X|Y]right)^2right]

be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of X based on Y, and do the same for rm MMSE(X. Then rm MMSE(X, i.e., using more information for estimation never hurts. However, how much do we gain by using both Y and Z? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:

displaystyle  	rm MMSE(X|Y) - rm MMSE(X|Y,Z) le 2 I(X;Z|Y),      (2)

where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,

displaystyle  	mathbb Ebig|mathbb E[X|Y]-mathbb E[X|Y,Z]big| le sqrt2 I(X;Z      (3)

— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):

displaystyle  beginarrayrcl  	rm MMSE(X|Y) - rm MMSE(X|Y,Z) &=&, mathbb Eleft[(X-mathbb E[X|Y])^2right] - mathbb Eleft[(X-mathbb E[X|Y,Z])^2right] \ 	&=&, mathbb E(mathbb E[X|Y,Z]^2) - mathbb E(mathbb E[X|Y]^2) \ 	&=&, mathbb Eleft[left( mathbb E[X|Y] - mathbb E[X|Y,Z] right)^2right] \ 	&ge&, left( mathbb Ebig|mathbb E[X|Y] - mathbb E[X|Z]big|right)^2, endarray

where the third line is due to the orthogonality principle, whereby

displaystyle  mathbb Ebig[(X-mathbb E[X|Y,Z])mathbb E[X|Y]big] = 0,

and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between P_XYZ and any other distribution Q_XYZ under which X rightarrow Y rightarrow Z is a Markov chain can be lower-bounded as

displaystyle  beginarrayrcl  D(P_XYZ | Q_XYZ) &ge&, frac12left[ rm MMSE(X|Y) - rm MMSE(X|Y,Z) right] \ &ge&, frac12left( mathbb E|mathbb E[X|Y] - mathbb E[X|Y,Z]| right)^2. endarray

Therefore, if the addition of Z improves the MMSE significantly, then the joint distribution P_XYZ is far from the set of all distributions Q_XYZ under which X rightarrow Y rightarrow Z is a Markov chain.