Monte-Carlo (MC)

Due to some projects, I’ve decided to start a set of posts on Monte-Carlo and its variants. These include Monte-Carlo (MC), Markov chain Monte-Carlo (MCMC), Sequential Monte-Carlo (SMC) and Multi-Level Monte-Carlo (MLMC). I’ll probably expand these posts further at a later point.

Here we cover “vanilla” Monte-Carlo, importance sampling and self-normalized importance sampling:

The idea of Monte-Carlo simulation is quite simple. You want to evaluate the expectation of $f(\theta)$ where $\theta$ is a some random variable (or set of random variables) with distribution $\mu$ and $f$ is a real-valued. Note that the expectation is really an integral:

$\begin{aligned} \mathbb E_{\theta \sim \mu} [ f(\theta)] = \int f(\theta) d \mu(\theta)\,.\end{aligned}$

So we can think of Monte-carlo as evaluating an integral.

Given you can sample $\theta_1,...,\theta_N \sim \mu$ then the expectation $\mathbb{E} [f(\theta)]$ can be approximated by

$\begin{aligned} \hat{\mathbb E}_{N} \left[ f(\theta) \right] := \frac{1}{N} \sum_{i=1}^N f(\theta_i) \, .\end{aligned}$

In particular, if the samples are i.i.d. the strong law of large numbers gives that, with probability $1$ ,

$\begin{aligned} \frac{1}{N} \sum_{i=1}^N f(\theta_i) \xrightarrow[N\rightarrow \infty]{} \mathbb E_{\theta \sim \mu} [ f(\theta)] \, .\end{aligned}$

Also it holds that

$\begin{aligned} \label{MCMC:rate} \mathbb E \left[ \left( \hat{\mathbb E}_N [f(\theta)] - \mathbb E_{\theta \sim \mu} [f(\theta)] \right)^2 \right]^{\frac{1}{2}} \leq \frac{ \mathbb V(f(\theta))^{\frac{1}{2}} }{ \sqrt{N} }\end{aligned}$

(since the variance satisfies $\mathbb V (a X+ bY) = a^2\mathbb V(X) + b^2\mathbb V(Y)$ for $X$ and $Y$ independent). So here we see the errors go down at rate $\frac{1}{\sqrt{N}}$ .

A Classical Example. Let $\theta = (\theta_1, \theta_2)$ where $\theta_1, \theta_2 \sim U[0,1]$ are independent. Let $\begin{aligned} f(\theta ) = \mathbb I [ \theta_1^2 + \theta_2^2 \leq 1]\end{aligned}$ Note that the area of the quarter circle $\{\theta \in [0,1]^2 : \theta_1^2 + \theta_2^2 \leq 1 \}$ is $\pi/4$ . Then

$\begin{aligned} \frac{1}{N} \sum_{i=1}^N \mathbb I [ \theta_1^2 + \theta_2^2 \leq 1] \xrightarrow[N\rightarrow\infty ]{} \frac{\pi}{4} \,.\end{aligned}$

The rate of convergence in this example is pretty atrocious when compared with numerical methods. ¹ However the example gets the main idea across: there is some difficult to calculate quantity (namely $\pi$ ), we generate random variables $\theta$ we do a calculation to get a random variable of interest $f(\theta )$ and then we repeat until we get a good average. The method is extremely simple and generalizable (to situations where other numerical methods are not readily available).

Importance Sampling.

If we want to calculate

$\begin{aligned} \mathbb E_{\mu} \left[ Z \right] \qquad\text{where}\qquad Z=f(\theta)\end{aligned}$

we don’t need to sample from $\mu$ , we can sample from another distribution $\nu$ instead (and this can help improve convergence). We can use $\nu$ instead of $\mu$ because

$\begin{aligned} \mathbb E_{\theta \sim \mu} \left[ Z\right] & = \int f(\theta) d \mu (\theta) \notag \\ & = \int f(\theta) \frac{d \mu}{d \nu} (\theta) d\nu (\theta) \notag \\ &= \mathbb E_{\theta\sim \nu} [\tilde Z] \label{MC:EZ}\end{aligned}$

where

$\begin{aligned} \tilde Z=f(\theta) \frac{d \mu}{d \nu}(\theta)\end{aligned}$

(Above $\frac{d \mu}{d \nu}(\theta)$ is the probability density function (pdf) of $\mu$ over the pdf of $\gamma$ for continuous random variables or is the probability mass function of $\mu$ over $\nu$ for discrete random variables, and in general is the Radon-Nikodym derivative.)

Thus when applying important sampling, we sample $\theta_1,...,\theta_N$ and we perform the estimate

$\begin{aligned} \hat{\mathbb E}_N[\tilde Z] =\frac{1}{N}\sum_{i=1}^N f(\theta_i) \frac{d \mu}{d \nu}(\theta_i)\end{aligned}$

The following lemma, although not entirely practical, gives good insights as to why importance sampling can help

Lemma [the Perfect Importance Sampler] If $Z:=f(\theta)\geq 0$ and we sample from $\nu$ with

$\begin{aligned} \frac{d\mu}{d \nu} = \frac{\mathbb E_\mu [ Z ]}{Z}\end{aligned}$

then the estimator $\tilde Z=Z\frac{d \mu}{d \nu}$ is such that

$\begin{aligned} \mathbb V_{\nu}(\tilde Z) = 0\end{aligned}$

Proof.

$\begin{aligned} \mathbb E_{\nu} [ \tilde Z^2 ] = \mathbb E_\mu \left[ Z^2 \frac{\mathbb E_\mu [Z]}{Z} \right] = \mathbb E_\mu [ Z]^2 = \mathbb E_\nu [ \tilde Z]^2 \, ,\end{aligned}$

where the last inequality follows by . Therefore $\mathbb V_\nu(\tilde Z) = 0$ . $\square$

The above suggest that we should choose $\theta$ with probability proportional to $|Z|=|f(\theta)|$ to get low variance.² Of course, we don’t know $f(\theta)$ in advance, so we cannot sample in this way. However, in practice, any sampling mechanism that concentrates selection around the area of interest would likely have a good impact on performance. Indeed importance sampling can substantially improve selection related to sampling from the underlying distribution $\mu$ .

Self-Normalized Importance Sampling.

In importance sampling, we apply a weight $w(\theta_i) = \frac{d \mu}{d \nu}(\theta_i)$ to each sample, here we know that $\mathbb E_{\theta \sim \nu} [ w(\theta)]=1$ . However, sometimes we only know these weights upto some constant (i.e. we don’t know the correct normalizing constant which happens a lot in Bayesian statistics) In that case, we can renormalize with the following self-normalized importance sample:

$\begin{aligned} \mathbb E_{\mu} [f(\theta) ] \approx \hat{\mathbb E}[f(\theta) ] = \frac{\sum_{i=1}^N f(\theta_i ) w(\theta_i) }{\sum_{i=1}^N w(\theta_i) } \, .\end{aligned}$

So long as the weights remain bounded, the rate of convergence is comparable to that of MCMC.

Proposition. If the weights $w(\theta_i)$ are bounded then for all bounded function $f$ it holds that

$\begin{aligned} ||\hat{\mathbb E}[f(\theta_i) ] - \mathbb E_\mu [ f(\theta)]||_{L_2} \leq 2 \frac{f_{\max} w_{\max}}{\sqrt{N}}\, .\end{aligned}$

Proof. Note that for $\bar w (\theta_i) := w(\theta_i) / \mathbb E[w(\theta_i)]$

$\begin{aligned} \frac{\sum_{i=1}^N f(\theta_i ) w(\theta_i) }{\sum_{i=1}^N w(\theta_i) } - \frac{\sum_{i=1}^N f(\theta_i ) \bar w(\theta_i) }{ N } = \left[ \frac{\sum_{i=1}^N f(\theta_i ) \bar w(\theta_i) }{\sum_{i=1}^N \bar w(\theta_i) } \right] \left\{ 1 - \frac{\sum_i \bar w(\theta_i)}{N} \right\}\end{aligned}$

Note that the term in square brackets is bounded by $f_{\max}$ . Thus applying this equality we have that

$\begin{aligned} & ||\hat{\mathbb E}[f(\theta_i) ] - \mathbb E_\mu [ f(\theta)]||_{L_2} \notag \\ \leq & f_{\max} \left\| 1 - \frac{\sum_i \bar w(\theta_i)}{N} \right\|_{L_2} + \left\| \frac{\sum_i \bar w(\theta_i) f(\theta_i)}{N} - \mathbb E [f(\theta) ] \right\|_{L_2} \label{MCMC:Ieq}\end{aligned}$

Now notice that, since $\mathbb E_\nu [ \bar w(\theta)f(\theta) ] = \mathbb E_\mu[f(\theta)]$ , we have that

$\begin{aligned} \left\| \frac{\sum_i \bar w(\theta_i) f(\theta_i)}{N} - \mathbb E [f(\theta) ] \right\|_{L_2} = \sqrt{ \frac{\mathbb V_{\nu}(\bar w(\theta)f(\theta))}{N} } \leq \frac{w_{\max} f_{\max} }{\sqrt{N}}\,.\end{aligned}$

The above inequality applies to both terms in (by taking $f=1$ ). Thus we have, as required, that

$\begin{aligned} ||\hat{\mathbb E}[f(\theta_i) ] - \mathbb E_\mu [ f(\theta)]||_{L_2} \leq 2 \frac{f_{\max} w_{\max}}{\sqrt{N}}\, .\end{aligned}$

$\square$

References

Monte-Carlo is by now a very standard method. Buffon’s Needle and the calculation of $\pi$ by Ulam and Von Neumann and coauthors are classical early examples, see Metropolis. The texts of Kroese et al. and Asmussen and Glynn provide good text book accounts.

Metropolis, N. “The Beginning of the Monto-Carlo Method.” Los Alamos Science 15 (1987): 125-130.

Asmussen, Søren, and Peter W. Glynn. Stochastic simulation: algorithms and analysis. Vol. 57. Springer Science & Business Media, 2007.

Kroese, Dirk P., Thomas Taimre, and Zdravko I. Botev. Handbook of monte carlo methods. Vol. 706. John Wiley & Sons, 2013.

Importance Sampling.

Self-Normalized Importance Sampling.

References

Share this:

Leave a comment Cancel reply