I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.

Consider three jointly distributed random variables . If the conditional mutual information , then and are conditionally independent given (i.e., is a Markov chain). In fact, if and only if . Now, what if is nonzero, but very small? Intuitively, one would expect that the joint distribution is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information is the divergence between and the nearest distribution under which is a Markov chain.

Theorem 1Let be three jointly distributed random variables. ThenMoreover, the minimum is achieved by

*Proof:* To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution for which . Then

The first two terms on the right-hand side are equal, respectively, to and . For the third term, we can write

Putting everything together, we have

- The right-hand side of (1) is greater than or equal to .
- The first three terms on the right-hand side are zero if and only if is such that and , for all that satisfy .

This concludes the proof.

Here is one interesting consequence. Consider three random variables , and , where . Let

be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of based on , and do the same for . Then , i.e., using more information for estimation never hurts. However, how much do we gain by using both and ? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:

where the mutual information is measured in nats. This inequality implies (a generalization of) Tao’s inequality,

— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):

where the third line is due to the orthogonality principle, whereby

and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between and any other distribution under which is a Markov chain can be lower-bounded as

Therefore, if the addition of improves the MMSE significantly, then the joint distribution is far from the set of all distributions under which is a Markov chain.