Markov Chain Monte Carlo (MCMC)

Markov chain Monte Carlo is a variant of the Monte Carlo, where samples are no longer independent but instead are sampled from a Markov chain. This can be useful in Bayesian statistics, or when we sequentially adjust a small number of parameters for a more complex combined distribution.

We cover MCMC, its use in Bayesian statistics, Metropolis-Hastings, Gibbs Sampling, and Simulated Annealing.

Sometimes the underlying distribution $\mu$ that we wish to sample from can be complex. However, even though it may not be possible to obtain I.I.D. samples from $\mu$ , it may be possible to construct a Markov chain $(\theta_t : t\in \mathbb Z_+)$ whose stationary distribution is $\mu$ . Again if we wish to estimate

$\begin{aligned} \label{MCMC:f_theta} \mathbb E_{\theta \sim \mu} [f(\theta)] , \end{aligned}$

then we can use the estimator

$\begin{aligned} \hat{\mathbb E}_N [ f(\theta)] = \frac{1}{N} \sum_{i=1}^N f(\theta_i) \, .\end{aligned}$

This procedure of estimating in this way is called Markov Chain Monte-Carlo (MCMC).

As before, under mild conditions¹ the Markov chain is ergodic, meaning with probability $1$ ,

$\begin{aligned} \hat{\mathbb E}_N [ f(\theta)] := \frac{1}{N} \sum_{i=1}^N f(\theta_i) \xrightarrow[N\rightarrow\infty]{ } \mathbb E_{\theta \sim \mu} [ f(\theta)]\,.\end{aligned}$

Under suitable conditions² it often holds that

$\begin{aligned} \sqrt{N} \left\{ \hat{ \mathbb E}_N \left[ f(\theta) \right] - { \mathbb E}_\mu \left[ f(\theta) \right] \right\} \xrightarrow[N\rightarrow \infty ]{\mathcal D} \mathcal N (0,\sigma^2) \end{aligned}$

(where the above limit denotes convergence in distribution).
Assuming second moments remain bounded for all $N$ , this then implies

$\begin{aligned} \mathbb E \left[ \left( \hat{ \mathbb E}_N [f(\theta)] - \mathbb E_{\mu} [f(\theta)] \right)^2 \right]^{\frac{1}{2}} \sim \frac{\sigma^2}{\sqrt{N}}\, .\end{aligned}$

Again errors go down like $1/\sqrt{N}$ .

MCMC in Bayesian Statistics.

Perhaps the most important example where MCMC is used is time implement Bayes rule in Bayesian statistics. Recall that in Bayesian statistics we start with a prior distribution $p(\theta)$ ; We receive data $D$ here $P(D|\theta)$ is the probability of data $D$ under parameter $\theta$ , and we use Bayes’ rule to update to the posterior distribution

$\begin{aligned} p(\theta | D) = \frac{ p(D|\theta) p(\theta) }{ \int p(D|\phi) p(\phi) d\phi }\, .\end{aligned}$

Above, we know $p(D|\phi)$ and $p(\phi)$ (because these are usually chosen in advance) but it is rare to have a closed form expression for the integral

$\begin{aligned} \int p(D|\phi) p(\phi) d \phi \, .\end{aligned}$

This is where MCMC comes in. We can sample from $p(\theta|D)$ with Markov chain Monte Carlo. This is because we can calculate the ratio

$\begin{aligned} \label{MCMC:PD} \frac{p(\theta | D)}{p(\phi| D)} = \frac{ p(D|\theta) p(\theta) }{ p(D|\phi) p(\phi) } \, .\end{aligned}$

Note each term on the right hand side is assumed known. Given one easy way to sample from $p(\theta | D)$ is with the Metropolis-Hastings Algorithm.

Metropolis-Hastings.

Metropolis Hastings is the simplest and most popular MCMC algorithm. Suppose we want to sample from a probability distribution

$\begin{aligned} \label{MCMC:muc} \mu(\theta) = \frac{ c(\theta) }{ \int c(\phi) d \phi }\, .\end{aligned}$

Here $c(\theta)$ is known but the integral in is not. (Note in the Bayesian stats example above $c(\theta) = p(D|\theta) p(\theta)$ .)

The Metropolis-Hastings Markov chain is determined by the function $c(\theta)$ given above and a “proposal” probability distribution $q(\cdot | \theta)$ for each $\theta$ . The state of the Markov chain at time $t$ is $\theta_t$

Propose. Now propose a new value $\theta'$ with probability $q(\theta'|\theta_t)$ .
Select. Set $\begin{aligned} \theta_{t+1} = \begin{cases} \theta_t & \text{with probability } 1- \frac{ c(\theta') q(\theta' | \theta_t) }{ c(\theta_t) q(\theta_t | \theta') } , \\ \theta' & \text{otherwise.} \end{cases}\end{aligned}$

A few remarks.
1. Note the probability above could be negative in which case the probability of staying at $\theta_t$ is set to be zero.
2. Noticed we have not really specified $q(\theta'|\theta)$ . This is because it is a design choice. So long as there is a non-zero chance of visiting each $\theta \in \Theta$ then the algorithm will work.
3. A choice for $\theta'$ would be to uniformly sample from $\Theta$ . Another choice would be to uniformly sample from the neighborhood of $\theta$ . In both cases it would hold that

$\begin{aligned} q(\theta |\theta') = q(\theta' | \theta) \end{aligned}$

and thus the function $q$ will not feature in the selection step.

We now verified that the Markov chain $(\theta_t: t \in\mathbb Z_+)$ constructed is such that if $\theta$ is sampled with probability $\mu$ .

Proposition. The Metropolis-Hastings Markov chain is reversible and has a stationary distribution given by $\mu(\theta)$ .

Proof. Recall that a Markov chain with transition matrix $P(\theta| \phi)$ is reversible with stationary distribution $\mu$ if

$\begin{aligned} \mu(\theta) P(\phi | \theta) = \mu (\theta) P(\theta | \phi) \end{aligned}$

for all $\theta, \phi\in \Theta$ . Notice for $\mu(\theta)\propto c(\theta )$ as given in . The probability of being in $\theta$ and going to $\theta'$ is

$\begin{aligned} \mu(\theta) q(\theta' | \theta) \left( 1 \wedge \frac{c(\theta')q(\theta'|\theta)}{c(\theta) q(\theta|\theta')} \right)\end{aligned}$

while going from $\theta'$ to $\theta$ it is

$\begin{aligned} \mu(\theta') q(\theta | \theta') \left( 1 \wedge \frac{c(\theta)q(\theta|\theta')}{c(\theta') q(\theta'|\theta)} \right)\end{aligned}$

Notice that $\mu(\theta) \propto c(\theta)$ above both sides are equal to

$\begin{aligned} c(\theta) q(\theta | \theta' ) \wedge c(\theta') q(\theta' | \theta)\end{aligned}$

Thus, as required, the Metropolis-Hastings sampler is reversible Markov chain with stationary distribution

$\begin{aligned} \mu(\theta) = \frac{ c(\theta) }{ \int c(\phi) d \phi }\, .\end{aligned}$

$\square$

Gibbs Sampler

The Gibbs sampler is a special case of Metropolis Hastings. It applies when we can sample marginal distributions. I.e. Here $\theta$ is a vector $\theta= (\theta_i : i \in\mathcal I)$ and we can sample from the distribution of $\begin{aligned} \theta_i \sim \mu(\cdot | \theta_{-i})\end{aligned}$ (Here $\theta_{-i}=(\theta_j : j \in\mathcal I\backslash\{ i\})$ and here $\mu(\cdot | \theta_{-i})$ is the conditional distribution of $\mu$ given $\theta_{-i}$ . )

When there is independence or conditional independence between a limited set of parameters (like in a Bayesian network and/or an Ising model), then it often turns our that $P(\theta_i | \theta_{-i})$ has a simple form.

The Gibbs sampler does the following, at time $t$ in state $\theta(t)$ :

Chose an index $i$
Sample $latex
\theta’_i \sim \mu(\cdot | \theta_{-i} )
$
Set $\begin{aligned} \theta_i(t+1) = \theta'_i\end{aligned}$ and keep all other components the same, $\theta_{-i}(t+1) = \theta_{-i}(t)$ .

We have not been too specific about how $i$ is sampled. The easiest method is to one-by-one go through each index. Also random sampling will work. (Anything works really as long as the over all policy $(\theta(t),i(t))$ is Markov and we sample each index a positive proportion of the time.)

Gibbs is Metropolis-Hastings. $^\star$

It is worth noting that Gibbs sampling is really just a special case of the Metropolis-Hastings algorithm.

To see this, notice that if $\mu( \theta)$ is our target distribution and $q({ \theta}'| \theta)= \mu(\theta'_i| \theta_{-i})$ is our proposal distribution then assuming index $i$ is chosen for transferring between $\theta$ and and $\theta' = (\theta_i', \theta_{-i})$ then the Metropolis Hastings condition is such that

$\begin{aligned} \frac{c(\theta) q(\theta'| \theta) }{c(\theta') q(\theta| \theta')} & = \frac{\mu(\theta) \mu (\theta_i'| \theta_{-i})}{\mu(\theta') \mu (\theta_i| \theta'_{-i})} \\ & = \frac{\mu(\theta) \mu (\theta_i'| \theta_{-i})}{\mu(\theta') \mu (\theta_i| \theta_{-i})} \\ & = \frac{\mu(\theta) \mu (\theta_i')}{\mu(\theta') \mu (\theta_i)} =1\end{aligned}$

Thus we see that the Metropolis-Hastings rule with this choice of $\mu$ and $q$ is exactly the Gibbs sampler.

Simulated Annealing

Simulated Annealing is a further variant of Metropolis Hastings. It acts as a first order optimization algorithm. (Similar to the Cross-Entropy method it is somewhat heuristic.) Here we suppose that we want to solve an optimization problem

$\begin{aligned} \min_{\theta \in \Theta } E(\theta) \end{aligned}$

The following probability distribution, which is parameterized by $T$ , concentrates on the solution of the above optimization

$\begin{aligned} \pi_T(\theta) = \frac{e^{-E(\theta)/T}}{\sum_{\phi\in\Theta} e^{-E(\phi)/T}} \xrightarrow[T\rightarrow 0 ]{} \frac{\mathbb I [ \theta \in \Theta^\star]}{|\Theta^\star|}\, .\end{aligned}$

Here $\Theta^\star$ is the set of solutions to the above optimization. The above distribution is often interpreted in terms of physics applications, specifically, $\pi_T(\theta)$ is know as Boltzmann’s distribution, $E(\theta)$ is the energy associated with state $\theta$ and $T$ is the temperature of that state.

We can easily simulated the distribution $\pi_T(\cdot)$ with Metropolis-Hastings. To do this, we slowly decrease $T$ and then the solution should concentrate on the set of optimal parameters. Specifically, given the current state $\theta_t$ and the temperature $T_t$ , we do the following

Chose $\phi$ in some neighbourhood of $\theta$ and evaluate $E(\theta)$
Set
<img title=”\begin{aligned} \theta_{t+1} = \begin{cases} \phi & \text{if } E(\phi)

We can then let $T\rightarrow 0$ while running our chain. We will never exactly sample from the distribution $\pi_T(\cdot)$ but given $T_t$ decreases sufficiently slowly then we should get an close approximation. There are conditions that guarentee convergence to optimality, specifically that $T_t = c / \log t$ . However, these rates are considered to be too slow for practical use and so in practice much quick rates of convergence to zero are used. For this reason, despite some supporting theory, the algorithm remains somewhat of a heuristic.

References

Again the text of Kroese et al. and Assmussen and Glynn provide good text book accounts. Also the Markov chains book of Bremaud. The convergence of simulated annealing is given by Hajek.

Kroese, Dirk P., Thomas Taimre, and Zdravko I. Botev. Handbook of monte carlo methods. Vol. 706. John Wiley & Sons, 2013.

Asmussen, Søren, and Peter W. Glynn. Stochastic simulation: algorithms and analysis. Vol. 57. Springer Science & Business Media, 2007.

Brémaud, Pierre. Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Vol. 31. Springer Science & Business Media, 2013.

Hajek, Bruce. “Cooling schedules for optimal annealing.” Mathematics of operations research 13.2 (1988): 311-329.

Being fully formal here, we need some extra property that the Markov chain is irreducible.↩
Being geometrically ergodic.↩