Undirected models are better at sampling

The best directed models should always be a worse at generating samples than the best undirected models, even if their log likelihoods are similar for a simple reason.

If we have an undirected model, then it defines a probability distribution by the equation

As always, the standard objective of unsupervised learning is to find a distribution p(x;theta) so that the average log probability of the data distribution E_xsim D(x) [ log p(x;theta) ] is as large as possible.

In theory, if we learn successfully, we should reach a local maxima of the average log probability. Taking the derivative and setting it to zero yields

E_xsim D(x)[nabla_theta G(x;theta^*)] = E_xsim p(x;theta)[nabla_theta G(x;theta^*)]

(here theta^* are the maximum likelihood parameters). Notice that this equation is a statement about the samples produced by the distribution p(x;theta^*): the gradient of the goodness nabla_theta G(x;theta^*) averaged over the data distribution D(x) is equal to the same gradient averaged over the model’s distribution p(x;theta^*).  Therefore, the samples from p(x;theta^*) must somehow be related to the samples from the data distribution D(x). This is a “promise” made to us by the learning objective of unsupervised learning.

However, directed models do not offer such a guarantee; instead, it promises that the conditional distributions of the data distribution will be similar to the conditional distributions of the model’s distribution, when the conditioned data is sampled form the data distribution. This is the critical point.

More formally, a directed model defines a distribution p(x;theta)=prod_j p(x_j|x_<j;theta). Plugging it in into the objective of maximizing the average log likelihood of the data distribution D(x), we get the following:

sum_j E_D(x)[log p(x_j|x_<j;theta)],

which is a sum of indepedent problems.

IF the p(x_j|x_<j;theta)‘s don’t share parameters for different j‘s, then the problems are truly independent and could be solved completely separately. So let’s say we found a theta^* that makes all these objectives happy. Then E_D(x_<j)[(E_D(x_j [log p(x_j|x_<j;theta^*)] will be happy, which means that p(x_j|x_<j,theta^*) is similar, more or less, to D(x_j|x_<j) for x_<j being sampled from D(x_<j) — which is the critical implied assumption made by the maximum likelihood objective applied to directed models. Why is it a problem when generating samples? It’s bad because this objective makes no “promises” about the behaviour of p(x_j|x_<j;theta^*) when x_<j sim p(x_<j;theta^*). It is easy to imagine that a p(x_1;theta^*) will be somewhat different from D(x_1), and say that x_1 was sampled from p(x_1;theta^*). Then p(x_2|x_1;theta^*) will freak out, having never seen anything like x_1, which will make the sample (x_1,x_2) look even less like a sample from D(x_1,x_2). Etc. This “chain reaction” will likely cause the directed model to produce worse-looking samples than an undirected model with a similar log probability.

But something should be odd: after all, any undirected model (or distribution for that matter) can be decomposed with the chain rule, p(x_1,ldots,x_n)=prod_j p(x_j|x_<j). Why won’t the above argument apply to an undirected model, which I claim is to be superior at sampling? An answer can be given, but it involves lots of handwaving.

If an undirected model is expressed as a directed model using the chain rule, then the conditional probabilities will involve massive marginalizations. What’s more, all the conditional distributions p(x_j|x_<j) will share parameters in a very complicated way for different values of j. In all likelihood (and that’s the weak part of the argument),  the parameterization is so complex that it’s not possible to make all the objectives E_D(x_<j)[(E_D(x_j [log p(x_j|x_<j)]  happy for all j simultaneously; that is, the undirected model will not necessarily make p(x_j|x_<j) similar to D(x_j|x_<j) when x_<jsim D(x_<j). This is why I assumed that the little conditionals don’t share parameters.

So to summarize, directed models are worse at sampling because of the sequential nature of their sampling procedure. By sampling in sequence, the directed model is “fed” data which is unlike the training distribution, causing it to freak out. In contrast, sampling from undirected models requires an expensive Markov chain, which ensures the “self-consistency” of the sample. And intuitively, since we invest more work into obtaining the sample, it must be better.