Larry Wasserman‘s recent post about misinterpretation of p-values is a good reminder about a fundamental distinction anyone working in information theory, control or machine learning should be aware of — namely, the distinction between stochastic kernels and conditional probability distributions.
Roughly speaking, stochastic kernels are building blocks, objects that have to be interconnected in order to instantiate stochastic systems. Conditional probability distributions, on the other hand, arise only when we apply Bayes’ theorem to joint probability distributions induced by these interconnections.
At a very high level of abstraction, we may imagine a space of observations or outcomes and a space of states or inputs . Each possible state induces a probability distribution over — let’s denote it by . The interpretation is that if the state is , then the probability that we observe an outcome in some set is . Notice that this stipulation has the flavor of a conditional statement: if A then B. Mathematical statisticians (going back to Abraham Wald, and greatly elaborated by Lucien Le Cam and his followers) like to think of the collection as an experiment that reveals something about the state in through a random observation in . Note that is not a set of probability distributions — two distinct ‘s may carry two identical ‘s (which would indicate that these two ‘s are statistically indistinguishable on the basis of observations); or, in the simplest case of a binary state space