## Conditional mutual information and the best Markov approximation

I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.

Consider three jointly distributed random variables . If the conditional mutual information $Z) = 0$, then $X$ and $Y$ are conditionally independent given $Z$ (i.e., $X rightarrow Z rightarrow Y$ is a Markov chain). In fact, $Z) = 0$ if and only if $X rightarrow Z rightarrow Y$. Now, what if $I(X;Y$ is nonzero, but very small? Intuitively, one would expect that the joint distribution $P_XYZ$ is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information $I(X;Y$ is the divergence between $P_XYZ$ and the nearest distribution $Q_XYZ$ under which $X rightarrow Z rightarrow Y$ is a Markov chain.

Theorem 1 Let $X,Y,Z$ be three jointly distributed random variables. Then

$displaystyle I(X;Y|Z) = min left Q_XYZ) : X rightarrow Z rightarrow Y text under Q right$

Moreover, the minimum is achieved by

$displaystyle Q^*_XYZ(x,y,z) = P_Z(z)P_X(x|z)P_Z(y|z).$

Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution $Q_XYZ$ for which $X rightarrow Z rightarrow Y$. Then

$displaystyle beginarrayrcl D(P_XYZ | Q_XYZ) &=&, mathbb Eleft[ log fracP_XYZ(X,Y,Z)Q_XYZ(X,Y,Z)right] \ &=&, mathbb Eleft[ log left(fracP_Z(Z)Q_Z(Z) cdot fracP_X(XZ) cdot fracX,Z)Q_Z(Yright)right] \ &=&, mathbb Eleft[log fracP_Z(Z)Q_Z(Z)right] + mathbb Eleft[log fracP_X(XZ)right] + mathbb Eleft[log fracX,Z)Q_Z(Yright]. endarray$

The first two terms on the right-hand side are equal, respectively, to $Q_Z)$ and $Q_X$. For the third term, we can write

$displaystyle beginarrayrcl mathbb Eleft[log fracX,Z)Q_Z(Yright] &=&, mathbb Eleft[ log fracP_XYZ(X,Y,Z)P_XZ(X,Z)Q_Z(Yright] \ &=&, mathbb Eleft[ log left( fracP_X(XP_X(X cdot fracZ)Q_Z(Yright)right] \ &=&, I(X;Y | Z) + D(P_Z | Q_Z | P_Z). endarray$

Putting everything together, we have

$displaystyle D(P_XYZ | Q_XYZ) = D(P_Z | Q_Z) + D(P_X | Q_X | P_Z) + D(P_Z | Q_Z | P_Z) + I(X;Y|Z) (1)$

Now we observe two things:

1. The right-hand side of (1) is greater than or equal to $I(X;Y$.
2. The first three terms on the right-hand side are zero if and only if $Q_XYZ$ is such that $P_Z = Q_Z$ and $P_X = Q_X$, $P_Z=z = Q_Z=z$ for all $z$ that satisfy $P_Z(z) > 0$.

This concludes the proof. $Box$

Here is one interesting consequence. Consider three random variables $X$, $Y$ and $Z$, where . Let

$displaystyle rm MMSE(X|Y) = inf_f mathbb Eleft[(X-f(Y))^2right] = mathbb Eleft[left(X - mathbb E[X|Y]right)^2right]$

be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of $X$ based on $Y$, and do the same for $rm MMSE(X$. Then $rm MMSE(X$, i.e., using more information for estimation never hurts. However, how much do we gain by using both $Y$ and $Z$? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:

$displaystyle rm MMSE(X|Y) - rm MMSE(X|Y,Z) le 2 I(X;Z|Y), (2)$

where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,

$displaystyle mathbb Ebig|mathbb E[X|Y]-mathbb E[X|Y,Z]big| le sqrt2 I(X;Z (3)$

— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):

$displaystyle beginarrayrcl rm MMSE(X|Y) - rm MMSE(X|Y,Z) &=&, mathbb Eleft[(X-mathbb E[X|Y])^2right] - mathbb Eleft[(X-mathbb E[X|Y,Z])^2right] \ &=&, mathbb E(mathbb E[X|Y,Z]^2) - mathbb E(mathbb E[X|Y]^2) \ &=&, mathbb Eleft[left( mathbb E[X|Y] - mathbb E[X|Y,Z] right)^2right] \ &ge&, left( mathbb Ebig|mathbb E[X|Y] - mathbb E[X|Z]big|right)^2, endarray$

where the third line is due to the orthogonality principle, whereby

$displaystyle mathbb Ebig[(X-mathbb E[X|Y,Z])mathbb E[X|Y]big] = 0,$

and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between $P_XYZ$ and any other distribution $Q_XYZ$ under which $X rightarrow Y rightarrow Z$ is a Markov chain can be lower-bounded as

$displaystyle beginarrayrcl D(P_XYZ | Q_XYZ) &ge&, frac12left[ rm MMSE(X|Y) - rm MMSE(X|Y,Z) right] \ &ge&, frac12left( mathbb E|mathbb E[X|Y] - mathbb E[X|Y,Z]| right)^2. endarray$

Therefore, if the addition of $Z$ improves the MMSE significantly, then the joint distribution $P_XYZ$ is far from the set of all distributions $Q_XYZ$ under which $X rightarrow Y rightarrow Z$ is a Markov chain.