Information flow on graphs

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time is determined stochastically by its own state, as well as by the states of its neighbors, at time t, independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.

Now let’s move on to precise definitions. Consider an undirected graph G = (cal V,cal E). The neighborhood of a vertex v in cal V, denoted by partial v, is the set of all vertices u in cal V that are connected to v by an edge. We will also denote the set consisting of v and all of its neighbors by partial_+ v. Let us fix a finite alphabet mathsf X; for each vertex v in cal V, let mathsf X_v denote a copy of mathsf X. A configuration is an assignment of a symbol x_v in mathsf X_v to each vertex v in cal V. Thus, Omega = mathsf X^cal V is the space of all configurations. We will denote a generic configuration by x = (x_v)_v in cal V; for any set cal J subset cal V, we will denote by x_cal J the natural projection of x onto mathsf X^cal J = prod_v in cal Jmathsf X_v: x_cal J = (x_v)_v in cal J.

Now let’s consider the following model of a distributed one-bit memory cell. Let W be a rm Bernoulli(1/2) random variable, which is the bit we wish to store. We also pick two probability measures mu_0 and mu_1 on Omega. Given W=w, we draw the initial (t=1) configuration X_1 = (X_v,1)_v in cal V at random according to mu_w. We allow arbitrary correlations between the different X_v,1‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each v, we have a channel (random transformation) K_v with input alphabet mathsf X^partial_+ v = prod_u in partial_+ v mathsf X_u and output alphabet mathsf X_v. Then the configuration X_t = (X_v,t)_v in cal V at time t evolves according to the law

displaystyle  	mathbb P[X_t+1=x_t+1|W=w,X^t = x^t] = prod_v in cal VK_v(x_v,t+1|x_partial_+ v,t), qquad t=1,2,ldots      (1)

with the initial conditions mathbb P[W=w]=1/2, w in �,1, and W=w]=mu_w(x_1), x_1 in Omega. Thus, we end up with a Markov chain

displaystyle  W longrightarrow X_1 longrightarrow X_2 longrightarrow X_3 longrightarrow ldots,

where at each time t the evolution from X_t to X_t+1 is also Markov “in space:” the state of vertex v at time t+1 is determined only by the states of the neighbors of v at time t, i.e., for each v and t, we have a Markov chain

displaystyle  left( W; X^t-1; X_cal V backslash partial_+ v,t right) longrightarrow X_partial_+ v,t longrightarrow X_v,t+1.

Now the question is: does the configuration at an arbitrary time t retain any information about W? One way to answer this is to pick an arbitrary finite set cal J subset cal V and to examine the evolution of the mutual information I(W; X_cal J,t), where

displaystyle  X_cal J,t = (X_v,t)_v in cal J.

If we can show that there exists some varepsilon > 0 such that I(W; X_cal J,t) ge varepsilon for all t and at least one nonempty cal J, then our memory cell is capable of storing one bit. If, on the other hand, I(W; X_cal J,t) rightarrow 0 as t rightarrow infty for every finite cal J, then eventually the memory cell will forget W.

1. Dawson’s impossibility result

Roughly speaking, Dawson proved the following: if the local channels K_v are sufficiently noisy and if the graph G is sufficiently sparse, then the mutual information I(W; X_cal J,t) will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel K_v, Dawson introduces the following quantity, which he calls the transmission coefficient of K_v:

displaystyle  	eta_v := sup_W rightarrow Y rightarrow Z fracI(W; Z)I(W; Y),      (2)

where the supremum is over all Markov chains W rightarrow Y rightarrow Z, where W sim rm Bernoulli(1/2), the channel from W to Y in mathsf X^partial_+ v is arbitrary, and the channel from Y to Z in mathsf X_v is given by K_v. Moreover, let eta := sup_v in cal V eta_v. Clearly, eta_v le 1 for every v in cal V, by the data processing inequality. Consequently, eta le 1.

Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution P_Y and let

displaystyle  		delta^*_v(P_Y) := sup_U rightarrow Y rightarrow Z fracI(U; Z)I(U;Y), 	     (3)

where the supremum is over all Markov chains U rightarrow Y rightarrow Z satisfying the constraints Y sim P_Y and P_Z = K_v. The difference is that in (2) we constrain the marginal distribution of W and the stochastic kernel P_Z, whereas in (3) we constrain the marginal of Y and the kernel P_Z. As shown by Anantharam et al.,

displaystyle  	delta^*_v(P_Y) = sup_Q_Y neq P_Y fracD(Q_Z D(Q_Y ,

where the supremum is over all marginal distributions Q_Y of Y distinct from P_Y, and Q_Z denotes the marginal distribution of Z when Y sim Q_Y. Both eta_v and delta^*_v measure the “noisiness” of K_v, but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time t, the state X_v,t+1 of a given node at time t depends only on X_partial_+v,t and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for v and its neighbors, is “frozen” at some value x_cal Vbackslashpartial_+v, then the amount of information about W that node v will receive at time t+1 will be controlled by the quantity

displaystyle  	frac X_cal Vbackslash partial_+v,t = x_cal Vbackslashpartial_+v) X_cal Vbackslash partial_+v,t = x_cal Vbackslashpartial_+v),

which, by definition, is no more than eta_v. We will elaborate on this point later on.

Also, Dawson assumes that G has bounded degree, i.e.,

displaystyle  Delta := sup_v in cal V|partial_+ v| < infty.

Now we are in a position to state the following:

Theorem 1 (Dawson) For any finite set cal J subset cal V,

displaystyle  		I(W; X_cal J,t) le |cal J| (Deltaeta)^t-1 log 2. 	     (4)

Thus, if Delta eta < 1, then I(W; X_cal J,t) rightarrow 0 as t rightarrow infty, and the convergence is exponentially fast.

2. The proof

The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.

Lemma 2 Fix two arbitrary, possibly empty finite sets cal J,cal K subset cal V. Then, for any v notin cal J,

displaystyle  		I(W; X_v,t+1 | X_cal J,t+1, X_cal K,t) le eta I(W; X_partial_+ v,t | X_cal J,t+1, X_cal K,t). 	     (5)

Proof: We begin by noting that, since v notin cal J,

displaystyle  		(W,X_cal K , cup, partial_+ v,t,X_cal J,t+1) rightarrow X_partial_+ v,t rightarrow X_v,t+1 	     (6)

is a Markov chain. Now, for any fixed choices of the configurations x_cal K,t and x_cal J,t+1, consider two probability distributions nu_0 and nu_1 of Y in mathsf X^partial_+ v:

displaystyle  		nu_w(y) := mathbb PBig[ Y = yBig| W=w, X_cal K,t = x_cal K,t, X_cal J,t+1 = x_cal J,t+1Big], quad w in �,1.

From (6), if we consider a channel from W to Y with transition probabilities given by nu_w, and then let Z in mathsf X_v be the output of K_v with input Y, then

displaystyle  beginarrayrcl  	I(W;Y) &=& I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) \ 	I(W;Z) &=& I(W; X_v,t+1 | X_cal J,t+1 = x_cal J,t+1, X_cal K,t = x_cal K,t), endarray

and W rightarrow Y rightarrow Z is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient eta_v. Therefore,

displaystyle  beginarrayrcl  	&& I(W; X_v,t+1 | X_cal J,t+1 = x_cal J,t+1, X_cal K,t = x_cal K,t) nonumber\ 	&&quad le eta_v I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) \ 	&&quad le eta I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) . endarray

Taking the expectation with respect to X_cal J,t+1 and X_cal K,t, we obtain (5). Box

Lemma 3 For any finite set cal Jsubsetcal V, any v notin cal J, and an arbitrary (possibly empty) finite set cal K,

displaystyle  beginarrayrcl  		I(W; X_cal J cup v,t+1|X_cal K,t) &le& eta I(W; X_cal J,t+1 | X_cal K, cup, partial_+ v,t) nonumber\ 	&& qquad qquad + eta I(W; X_partial_+ v,t|X_cal K,t) nonumber\ 	&& qquad qquad + (1-eta) I(W; X_cal J,t+1|X_cal K,t). 	endarray

Proof: Using the chain rule for mutual information and Lemma 2,

displaystyle  beginarrayrcl  		I(W; X_cal J cup v,t+1|X_cal K,t) &=& I(W; X_cal J,t+1 | X_cal K,t) + I(W; X_v,t+1 | X_cal J,t+1, X_cal K,t) nonumber \ 		&le& I(W; X_cal J,t+1 | X_cal K,t) + eta I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t). 	endarray

Now let’s examine the mutual information  I(W; X_partial_+ v,t . Let us use the chain rule to write the conditional mutual information  X_cal K,t) in two ways:

displaystyle  beginarrayrcl  		I(W; X_partial_+ v,t, X_cal J,t+1 | X_cal K,t) &=& I(W; X_partial_+ v,t | X_cal K,t) + I(W; X_cal J,t+1 | X_cal K cup partial_+ v,t) \ 		&=& I(W; X_cal J,t+1 | X_cal K,t) + I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t). 	endarray

Therefore, we can write

displaystyle  I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t) = I(W; X_partial_+ v,t | X_cal K,t) + I(W; X_cal J,t+1 | X_cal K cup partial_+ v,t) - I(W; X_cal J,t+1 | X_cal K,t).

Substituting this into our bound on I(W; X_cal J cup v,t+1, we obtain the statement of the lemma. Box

The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation

displaystyle  	I^+_t(cal J |cal K) := I(W; X_cal J,t | X_cal K,t-1)       (7)

and

displaystyle  	I_t(cal J | cal K) := I(W; X_cal J,t | X_cal K,t)       (8)

for all finite, possibly empty sets cal J,cal Ksubsetcal V and for all t=1,2,ldots, where we also let

displaystyle  beginarrayrcl  && I^+_t(emptyset|cal K) = I_t(emptyset|cal K) = 0;\ && I^+_t(cal J | emptyset) = I(W; X_cal J,t); \ && I^+_1(cal J |cal K) = I(W; X_cal J,1) endarray

for all finite sets cal J,cal K subset cal V. A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about W that the nodes in cal J will contain at time t given what we know about the state of the nodes in some other set cal K at time t-1, while Eq. (8) quantifies the impact of knowing the state of the nodes in cal K at time t, and so represents correction. Moreover, we can write the bound of Lemma 3 as

displaystyle  	I^+_t+1(cal J cup v |cal K) le (1-eta) I^+_t+1(cal J | cal K) + eta I^+_t+1(cal J | cal K cup partial_+ v) + eta I_t (partial_+ v|cal K),      (9)

for any cal J,cal K, any v notincal J, and any t.

The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let 2^cal V_0 denote the collection of all finite subsets of cal V, and for any t define a set function I_t : 2^cal V_0 rightarrow mathbb R by

displaystyle  	I_t(cal J) := I(W; X_cal J,t) equiv Ibig(W; X_v,t, v in cal Jbig).

The quantity cdot) defined in Eq. (8) can be computed in terms of I_t(cdot): indeed, by the chain rule we have

displaystyle  beginarrayrcl  	I_t(cal J|cal K) &=& I(W; X_cal J,t | X_cal K,t) nonumber \ 	&=& I(W; X_cal J cup cal K,t) - I(W; X_cal K,t) nonumber \ 	&=& I_t(cal J cup cal K) - I_t(cal K). endarray

In addition, if we know cdot), we can also compute  cal K):

displaystyle  beginarrayrcl  	I_t+1(cal J | cal K) &=& I(W; X_cal J,t+1 | X_cal K,t+1) nonumber \ 	&=& I(W; X_cal J cup cal K,t+1) - I(W; X_cal K,t+1) nonumber \ 	&=& I_t+1(cal J cup cal K | emptyset) - I_t+1(cal K | emptyset) nonumber \ 	&=& I^+_t+1(cal J cup cal K | emptyset) - I^+_t+1(cal K | emptyset). endarray

Moreover, using our knowledge of I_t(cdot) and the fact that cal K) = 0 for all cal K, we can do the following: Let us fix an arbitrary node u in cal V. Then, from (9), we derive

displaystyle  beginarrayrcl  	I^+_t+1(u | cal K) &le& (1-eta) underbraceI^+_t+1(emptyset _=0 + eta underbrace cal K cup partial_+ u)_=0 + eta I_t(partial_+ u | cal K) nonumber \ 	&=& eta I_t(partial_+ u | cal K). endarray

Fixing another node u neq v, we have

displaystyle  beginarrayrcl  	I^+_t+1(u,v | cal K) &le& (1-eta)I^+_t+1(u|cal K) + eta I^+_t+1(u | cal K cup partial_+ v) + eta I_t(partial_+ v | cal K) \ 	&le& (1-eta) eta I_t(partial_+ u | cal K) + eta^2 I_t(partial_+ u | cal K cup partial_+ v) + eta I_t(partial_+ v | cal K) \ 	&=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 I_t(partial_+ u | cal K cup partial_+ v) nonumber\ 	&& qquad qquad qquad - (1-eta)eta I_t(partial_+ v | cal K) + eta I_t(partial_+ v | cal K) \ 	&=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 left[I_t(partial_+ v | cal K) + I_t(partial_+ u | cal K cup partial_+ v) right] \ 	&=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 I_t(partial_+ u cup partial_+ v | cal K), endarray

where the first step uses (9); the second uses our bound on cal K); in the third, we add and subtract  cal K); and the last step uses the chain rule. Continuing inductively, we see that, for any cal J, cal K in 2^cal V_0,

displaystyle  	I^+_t+1(cal J | cal K) le displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_ = k I_tBigg( bigcup_v in cal J' partial_+ v Bigg| cal KBigg).

In particular, using the formula emptyset) - I^+_t+1(cal K with cal K = emptyset, we obtain

displaystyle  	I(W; X_cal J,t+1) le displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_ = k IBigg(W; X_w,t, w in bigcup_v in cal J' partial_+ v Bigg).       (10)

The final step is to complete the proof using the fact that Delta eta < 1. Since W is binary, I(W; X_cal J,1) le H(W) le log 2. Therefore,

displaystyle  	I(W; X_cal J,1) le |cal J| log 2, qquad forall cal J in 2^cal V_0.      (11)

This can be interpreted as follows: if we define the specific information at time t by

displaystyle  U_t := sup_cal J fracI(W; X_cal J,t),

then U_1 le log 2. Now, we write

displaystyle  beginarrayrcl  	I(W; X_cal J,2) &le& displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:, IBigg(W; X_w,1, w in bigcup_v in cal J' partial_+ v Bigg) nonumber \ 	&le &displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:,  left| bigcup_v in cal J' partial_+ v right| log 2 nonumber \ 	&le &displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:,  displaystylesum_v in cal J' |partial_+ v| log 2 nonumber \ 	&le& displaystylesum^-1_k=0 eta^cal J(1-eta)^k -k (|cal J|-k) Delta log 2 nonumber \ 	&=& |cal J| Delta eta log 2 displaystylesum^-1_k=0 cal J eta^-1-k(1-eta)^k nonumber \ 	&=& |cal J| Delta eta log 2, endarray

where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that for every v; and the remaining steps are simple combinatorics. What we have shown is that the specific information at t=2 can be bounded as U_2 le Delta eta log 2. Thus, if Delta eta < 1, then in one step the specific information will strictly decrease. Iterating, we get the bound

displaystyle  	I(W; X_cal J,t) le |cal J| (Delta eta)^t-1 log 2.

If Delta eta < 1, then U_t le (Deltaeta)^t-1 rightarrow 0 as t rightarrow infty. This completes the proof of Theorem 1.