I came across a neat and useful result about conditional mutual information while reading a paper on quantum information theory.
Consider three jointly distributed random variables . If the conditional mutual information , then and are conditionally independent given (i.e., is a Markov chain). In fact, if and only if . Now, what if is nonzero, but very small? Intuitively, one would expect that the joint distribution is somehow “close” to being Markov. It turns out that this intuition is, indeed, correct — in fact, the conditional mutual information is the divergence between and the nearest distribution under which is a Markov chain.
Theorem 1 Let be three jointly distributed random variables. Then
Moreover, the minimum is achieved by
Proof: To keep things simple, I will assume that all random variables take values in finite sets. Consider any distribution for which . Then
The first two terms on the right-hand side are equal, respectively, to and . For the third term, we can write
Now we observe two things:
- The right-hand side of (1) is greater than or equal to .
- The first three terms on the right-hand side are zero if and only if is such that and , for all that satisfy .
This concludes the proof.
Here is one interesting consequence. Consider three random variables , and , where . Let
be the minimum mean-square error (MMSE) achievable by any (measurable) estimator of based on , and do the same for . Then , i.e., using more information for estimation never hurts. However, how much do we gain by using both and ? The following inequality, due to Yihong Wu and Sergio Verdú, gives an upper bound on the excess MMSE in terms of conditional mutual information:
— indeed, the square root of the left-hand side of (2) is greater than or equal to the left-hand side of (3):
where the third line is due to the orthogonality principle, whereby
and the last line uses Jensen’s inequality. Combining (2) and (3) with Theorem 1, we see that the divergence between and any other distribution under which is a Markov chain can be lower-bounded as
Therefore, if the addition of improves the MMSE significantly, then the joint distribution is far from the set of all distributions under which is a Markov chain.