## Information flow on graphs

Models of complex systems built from simple, locally interacting components arise in many fields, including statistical physics, biology, artificial intelligence, communication networks, etc. The quest to understand and to quantify the fundamental limits on the ability of such systems to store and process information has led to a variety of interesting and insightful results that draw upon probability, combinatorics, information theory, discrete and continuous dynamical systems, etc. In this post, I would like to focus on a model of distributed storage that was analyzed in 1975 by Donald Dawson in a very nice paper, which deserves to be more widely known.

The system Dawson analyzed is an instance of a Probabilistic Cellular Automaton (PCA), i.e., a system consisting of a discrete collection of nodes, each of which has a discrete state. The configuration of the automaton at each time is specified by listing the state of each node. The state of each node at time is determined stochastically by its own state, as well as by the states of its neighbors, at time $t$, independently of everything else. One of the key questions in the theory of PCAs is whether a given automaton is ergodic, i.e., if its configuration eventually “forgets” the initial condition. Thus, Dawson’s result is a sufficient condition for ergodicity of a PCA.

Now let’s move on to precise definitions. Consider an undirected graph $G = (cal V,cal E)$. The neighborhood of a vertex $v in cal V$, denoted by $partial v$, is the set of all vertices $u in cal V$ that are connected to $v$ by an edge. We will also denote the set consisting of $v$ and all of its neighbors by $partial_+ v$. Let us fix a finite alphabet $mathsf X$; for each vertex $v in cal V$, let $mathsf X_v$ denote a copy of $mathsf X$. A configuration is an assignment of a symbol $x_v in mathsf X_v$ to each vertex $v in cal V$. Thus, $Omega = mathsf X^cal V$ is the space of all configurations. We will denote a generic configuration by $x = (x_v)_v in cal V$; for any set $cal J subset cal V$, we will denote by $x_cal J$ the natural projection of $x$ onto $mathsf X^cal J = prod_v in cal Jmathsf X_v$: $x_cal J = (x_v)_v in cal J$.

Now let’s consider the following model of a distributed one-bit memory cell. Let $W$ be a $rm Bernoulli(1/2)$ random variable, which is the bit we wish to store. We also pick two probability measures $mu_0$ and $mu_1$ on $Omega$. Given $W=w$, we draw the initial $(t=1)$ configuration $X_1 = (X_v,1)_v in cal V$ at random according to $mu_w$. We allow arbitrary correlations between the different $X_v,1$‘s, so our storage is distributed in this sense. Now let’s suppose that the state of each vertex evolves stochastically in discrete time according to a local Markov model: for each $v$, we have a channel (random transformation) $K_v$ with input alphabet $mathsf X^partial_+ v = prod_u in partial_+ v mathsf X_u$ and output alphabet $mathsf X_v$. Then the configuration $X_t = (X_v,t)_v in cal V$ at time $t$ evolves according to the law

$displaystyle mathbb P[X_t+1=x_t+1|W=w,X^t = x^t] = prod_v in cal VK_v(x_v,t+1|x_partial_+ v,t), qquad t=1,2,ldots (1)$

with the initial conditions $mathbb P[W=w]=1/2$, $w in �,1$, and $W=w]=mu_w(x_1)$, $x_1 in Omega$. Thus, we end up with a Markov chain

$displaystyle W longrightarrow X_1 longrightarrow X_2 longrightarrow X_3 longrightarrow ldots,$

where at each time $t$ the evolution from $X_t$ to $X_t+1$ is also Markov “in space:” the state of vertex $v$ at time $t+1$ is determined only by the states of the neighbors of $v$ at time $t$, i.e., for each $v$ and $t$, we have a Markov chain

$displaystyle left( W; X^t-1; X_cal V backslash partial_+ v,t right) longrightarrow X_partial_+ v,t longrightarrow X_v,t+1.$

Now the question is: does the configuration at an arbitrary time $t$ retain any information about $W$? One way to answer this is to pick an arbitrary finite set $cal J subset cal V$ and to examine the evolution of the mutual information $I(W; X_cal J,t)$, where

$displaystyle X_cal J,t = (X_v,t)_v in cal J.$

If we can show that there exists some $varepsilon > 0$ such that $I(W; X_cal J,t) ge varepsilon$ for all $t$ and at least one nonempty $cal J$, then our memory cell is capable of storing one bit. If, on the other hand, $I(W; X_cal J,t) rightarrow 0$ as $t rightarrow infty$ for every finite $cal J$, then eventually the memory cell will forget $W$.

1. Dawson’s impossibility result

Roughly speaking, Dawson proved the following: if the local channels $K_v$ are sufficiently noisy and if the graph $G$ is sufficiently sparse, then the mutual information $I(W; X_cal J,t)$ will decay to zero exponentially fast. In order to state his result precisely, we need some preliminary definitions. First of all, in order to quantify the noisiness of each channel $K_v$, Dawson introduces the following quantity, which he calls the transmission coefficient of $K_v$:

$displaystyle eta_v := sup_W rightarrow Y rightarrow Z fracI(W; Z)I(W; Y), (2)$

where the supremum is over all Markov chains $W rightarrow Y rightarrow Z$, where $W sim rm Bernoulli(1/2)$, the channel from $W$ to $Y in mathsf X^partial_+ v$ is arbitrary, and the channel from $Y$ to $Z in mathsf X_v$ is given by $K_v$. Moreover, let $eta := sup_v in cal V eta_v$. Clearly, $eta_v le 1$ for every $v in cal V$, by the data processing inequality. Consequently, $eta le 1$.

Remark 1 Dawson’s definition of the transmission coefficient is reminiscent of the best constant in the data processing inequality of Erkip and Cover (corrected by Anantharam et al.): to define that constant in our setting, one would fix a distribution $P_Y$ and let

$displaystyle delta^*_v(P_Y) := sup_U rightarrow Y rightarrow Z fracI(U; Z)I(U;Y), (3)$

where the supremum is over all Markov chains $U rightarrow Y rightarrow Z$ satisfying the constraints $Y sim P_Y$ and $P_Z = K_v$. The difference is that in (2) we constrain the marginal distribution of $W$ and the stochastic kernel $P_Z$, whereas in (3) we constrain the marginal of $Y$ and the kernel $P_Z$. As shown by Anantharam et al.,

$displaystyle delta^*_v(P_Y) = sup_Q_Y neq P_Y fracD(Q_Z D(Q_Y ,$

where the supremum is over all marginal distributions $Q_Y$ of $Y$ distinct from $P_Y$, and $Q_Z$ denotes the marginal distribution of $Z$ when $Y sim Q_Y$. Both $eta_v$ and $delta^*_v$ measure the “noisiness” of $K_v$, but in different ways. In particular, the motivation behind Dawson’s definition is as follows: At any time $t$, the state $X_v,t+1$ of a given node at time $t$ depends only on $X_partial_+v,t$ and not on anything else. Thus, if boundary condition, i.e., the configuration of all the nodes except for $v$ and its neighbors, is “frozen” at some value $x_cal Vbackslashpartial_+v$, then the amount of information about $W$ that node $v$ will receive at time $t+1$ will be controlled by the quantity

$displaystyle frac X_cal Vbackslash partial_+v,t = x_cal Vbackslashpartial_+v) X_cal Vbackslash partial_+v,t = x_cal Vbackslashpartial_+v),$

which, by definition, is no more than $eta_v$. We will elaborate on this point later on.

Also, Dawson assumes that $G$ has bounded degree, i.e.,

$displaystyle Delta := sup_v in cal V|partial_+ v| < infty.$

Now we are in a position to state the following:

Theorem 1 (Dawson) For any finite set $cal J subset cal V$,

$displaystyle I(W; X_cal J,t) le |cal J| (Deltaeta)^t-1 log 2. (4)$

Thus, if $Delta eta < 1$, then $I(W; X_cal J,t) rightarrow 0$ as $t rightarrow infty$, and the convergence is exponentially fast.

2. The proof

The proof of Theorem 1 consists of several steps. The first step gives an upper bound on the information value of the state at a given node if we already observed the states of some other nodes. This bound is stated in Lemmas 2 and 3 below.

Lemma 2 Fix two arbitrary, possibly empty finite sets $cal J,cal K subset cal V$. Then, for any $v notin cal J$,

$displaystyle I(W; X_v,t+1 | X_cal J,t+1, X_cal K,t) le eta I(W; X_partial_+ v,t | X_cal J,t+1, X_cal K,t). (5)$

Proof: We begin by noting that, since $v notin cal J$,

$displaystyle (W,X_cal K , cup, partial_+ v,t,X_cal J,t+1) rightarrow X_partial_+ v,t rightarrow X_v,t+1 (6)$

is a Markov chain. Now, for any fixed choices of the configurations $x_cal K,t$ and $x_cal J,t+1$, consider two probability distributions $nu_0$ and $nu_1$ of $Y in mathsf X^partial_+ v$:

$displaystyle nu_w(y) := mathbb PBig[ Y = yBig| W=w, X_cal K,t = x_cal K,t, X_cal J,t+1 = x_cal J,t+1Big], quad w in �,1.$

From (6), if we consider a channel from $W$ to $Y$ with transition probabilities given by $nu_w$, and then let $Z in mathsf X_v$ be the output of $K_v$ with input $Y$, then

$displaystyle beginarrayrcl I(W;Y) &=& I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) \ I(W;Z) &=& I(W; X_v,t+1 | X_cal J,t+1 = x_cal J,t+1, X_cal K,t = x_cal K,t), endarray$

and $W rightarrow Y rightarrow Z$ is a Markov chain satisfying the conditions in the definition (2) of the transmission coefficient $eta_v$. Therefore,

$displaystyle beginarrayrcl && I(W; X_v,t+1 | X_cal J,t+1 = x_cal J,t+1, X_cal K,t = x_cal K,t) nonumber\ &&quad le eta_v I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) \ &&quad le eta I(W; X_partial_+ v,t | X_cal J,t+1 = x_cal J,t+1, X_cal K,t=x_cal K,t) . endarray$

Taking the expectation with respect to $X_cal J,t+1$ and $X_cal K,t$, we obtain (5). $Box$

Lemma 3 For any finite set $cal Jsubsetcal V$, any $v notin cal J$, and an arbitrary (possibly empty) finite set $cal K$,

$displaystyle beginarrayrcl I(W; X_cal J cup v,t+1|X_cal K,t) &le& eta I(W; X_cal J,t+1 | X_cal K, cup, partial_+ v,t) nonumber\ && qquad qquad + eta I(W; X_partial_+ v,t|X_cal K,t) nonumber\ && qquad qquad + (1-eta) I(W; X_cal J,t+1|X_cal K,t). endarray$

Proof: Using the chain rule for mutual information and Lemma 2,

$displaystyle beginarrayrcl I(W; X_cal J cup v,t+1|X_cal K,t) &=& I(W; X_cal J,t+1 | X_cal K,t) + I(W; X_v,t+1 | X_cal J,t+1, X_cal K,t) nonumber \ &le& I(W; X_cal J,t+1 | X_cal K,t) + eta I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t). endarray$

Now let’s examine the mutual information $I(W; X_partial_+ v,t$. Let us use the chain rule to write the conditional mutual information $X_cal K,t)$ in two ways:

$displaystyle beginarrayrcl I(W; X_partial_+ v,t, X_cal J,t+1 | X_cal K,t) &=& I(W; X_partial_+ v,t | X_cal K,t) + I(W; X_cal J,t+1 | X_cal K cup partial_+ v,t) \ &=& I(W; X_cal J,t+1 | X_cal K,t) + I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t). endarray$

Therefore, we can write

$displaystyle I(W; X_partial_+ v,t | X_cal J,t+1,X_cal K,t) = I(W; X_partial_+ v,t | X_cal K,t) + I(W; X_cal J,t+1 | X_cal K cup partial_+ v,t) - I(W; X_cal J,t+1 | X_cal K,t).$

Substituting this into our bound on $I(W; X_cal J cup v,t+1$, we obtain the statement of the lemma. $Box$

The main message of Lemma 3 can be stated more succinctly if we introduce shorthand notation

$displaystyle I^+_t(cal J |cal K) := I(W; X_cal J,t | X_cal K,t-1) (7)$

and

$displaystyle I_t(cal J | cal K) := I(W; X_cal J,t | X_cal K,t) (8)$

for all finite, possibly empty sets $cal J,cal Ksubsetcal V$ and for all $t=1,2,ldots$, where we also let

$displaystyle beginarrayrcl && I^+_t(emptyset|cal K) = I_t(emptyset|cal K) = 0;\ && I^+_t(cal J | emptyset) = I(W; X_cal J,t); \ && I^+_1(cal J |cal K) = I(W; X_cal J,1) endarray$

for all finite sets $cal J,cal K subset cal V$. A nice way to think about the above definitions is in terms of prediction and correction steps in Bayesian filtering: Eq. (7) corresponds to prediction since it quantifies the amount of information about $W$ that the nodes in $cal J$ will contain at time $t$ given what we know about the state of the nodes in some other set $cal K$ at time $t-1$, while Eq. (8) quantifies the impact of knowing the state of the nodes in $cal K$ at time $t$, and so represents correction. Moreover, we can write the bound of Lemma 3 as

$displaystyle I^+_t+1(cal J cup v |cal K) le (1-eta) I^+_t+1(cal J | cal K) + eta I^+_t+1(cal J | cal K cup partial_+ v) + eta I_t (partial_+ v|cal K), (9)$

for any $cal J,cal K$, any $v notincal J$, and any $t$.

The second step of the proof is to show that the exponential decay of information, according to Eq. (4), is a consequence of (9). To that end, Dawson employs a clever inductive argument. Let $2^cal V_0$ denote the collection of all finite subsets of $cal V$, and for any $t$ define a set function $I_t : 2^cal V_0 rightarrow mathbb R$ by

$displaystyle I_t(cal J) := I(W; X_cal J,t) equiv Ibig(W; X_v,t, v in cal Jbig).$

The quantity $cdot)$ defined in Eq. (8) can be computed in terms of $I_t(cdot)$: indeed, by the chain rule we have

$displaystyle beginarrayrcl I_t(cal J|cal K) &=& I(W; X_cal J,t | X_cal K,t) nonumber \ &=& I(W; X_cal J cup cal K,t) - I(W; X_cal K,t) nonumber \ &=& I_t(cal J cup cal K) - I_t(cal K). endarray$

In addition, if we know $cdot)$, we can also compute $cal K)$:

$displaystyle beginarrayrcl I_t+1(cal J | cal K) &=& I(W; X_cal J,t+1 | X_cal K,t+1) nonumber \ &=& I(W; X_cal J cup cal K,t+1) - I(W; X_cal K,t+1) nonumber \ &=& I_t+1(cal J cup cal K | emptyset) - I_t+1(cal K | emptyset) nonumber \ &=& I^+_t+1(cal J cup cal K | emptyset) - I^+_t+1(cal K | emptyset). endarray$

Moreover, using our knowledge of $I_t(cdot)$ and the fact that $cal K) = 0$ for all $cal K$, we can do the following: Let us fix an arbitrary node $u in cal V$. Then, from (9), we derive

$displaystyle beginarrayrcl I^+_t+1(u | cal K) &le& (1-eta) underbraceI^+_t+1(emptyset _=0 + eta underbrace cal K cup partial_+ u)_=0 + eta I_t(partial_+ u | cal K) nonumber \ &=& eta I_t(partial_+ u | cal K). endarray$

Fixing another node $u neq v$, we have

$displaystyle beginarrayrcl I^+_t+1(u,v | cal K) &le& (1-eta)I^+_t+1(u|cal K) + eta I^+_t+1(u | cal K cup partial_+ v) + eta I_t(partial_+ v | cal K) \ &le& (1-eta) eta I_t(partial_+ u | cal K) + eta^2 I_t(partial_+ u | cal K cup partial_+ v) + eta I_t(partial_+ v | cal K) \ &=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 I_t(partial_+ u | cal K cup partial_+ v) nonumber\ && qquad qquad qquad - (1-eta)eta I_t(partial_+ v | cal K) + eta I_t(partial_+ v | cal K) \ &=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 left[I_t(partial_+ v | cal K) + I_t(partial_+ u | cal K cup partial_+ v) right] \ &=& (1-eta)eta left[I_t(partial_+ u | cal K) + I_t(partial_+ v | cal K)right] + eta^2 I_t(partial_+ u cup partial_+ v | cal K), endarray$

where the first step uses (9); the second uses our bound on $cal K)$; in the third, we add and subtract $cal K)$; and the last step uses the chain rule. Continuing inductively, we see that, for any $cal J, cal K in 2^cal V_0$,

$displaystyle I^+_t+1(cal J | cal K) le displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_ = k I_tBigg( bigcup_v in cal J' partial_+ v Bigg| cal KBigg).$

In particular, using the formula $emptyset) - I^+_t+1(cal K$ with $cal K = emptyset$, we obtain

$displaystyle I(W; X_cal J,t+1) le displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_ = k IBigg(W; X_w,t, w in bigcup_v in cal J' partial_+ v Bigg). (10)$

The final step is to complete the proof using the fact that $Delta eta < 1$. Since $W$ is binary, $I(W; X_cal J,1) le H(W) le log 2$. Therefore,

$displaystyle I(W; X_cal J,1) le |cal J| log 2, qquad forall cal J in 2^cal V_0. (11)$

This can be interpreted as follows: if we define the specific information at time $t$ by

$displaystyle U_t := sup_cal J fracI(W; X_cal J,t),$

then $U_1 le log 2$. Now, we write

$displaystyle beginarrayrcl I(W; X_cal J,2) &le& displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:, IBigg(W; X_w,1, w in bigcup_v in cal J' partial_+ v Bigg) nonumber \ &le &displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:, left| bigcup_v in cal J' partial_+ v right| log 2 nonumber \ &le &displaystylesum^-1_k=0 eta^cal J(1-eta)^k displaystylesum_cal J' subseteq cal J:, displaystylesum_v in cal J' |partial_+ v| log 2 nonumber \ &le& displaystylesum^-1_k=0 eta^cal J(1-eta)^k -k (|cal J|-k) Delta log 2 nonumber \ &=& |cal J| Delta eta log 2 displaystylesum^-1_k=0 cal J eta^-1-k(1-eta)^k nonumber \ &=& |cal J| Delta eta log 2, endarray$

where the first step follows from (10); the second uses (11); the third step is by (over)counting; the fourth step uses the fact that  for every $v$; and the remaining steps are simple combinatorics. What we have shown is that the specific information at $t=2$ can be bounded as $U_2 le Delta eta log 2$. Thus, if $Delta eta < 1$, then in one step the specific information will strictly decrease. Iterating, we get the bound

$displaystyle I(W; X_cal J,t) le |cal J| (Delta eta)^t-1 log 2.$

If $Delta eta < 1$, then $U_t le (Deltaeta)^t-1 rightarrow 0$ as $t rightarrow infty$. This completes the proof of Theorem 1.