Monte-Carlo (MC)

Due to some projects, I’ve decided to start a set of posts on Monte-Carlo and its variants. These include Monte-Carlo (MC), Markov chain Monte-Carlo (MCMC), Sequential Monte-Carlo (SMC) and Multi-Level Monte-Carlo (MLMC). I’ll probably expand these posts further at a later point.

Here we cover “vanilla” Monte-Carlo, importance sampling and self-normalized importance sampling:

The idea of Monte-Carlo simulation is quite simple. You want to evaluate the expectation of f(\theta) where \theta is a some random variable (or set of random variables) with distribution \mu and f is a real-valued. Note that the expectation is really an integral:

So we can think of Monte-carlo as evaluating an integral.

Given you can sample \theta_1,...,\theta_N \sim \mu then the expectation \mathbb{E} [f(\theta)] can be approximated by

In particular, if the samples are i.i.d. the strong law of large numbers gives that, with probability 1,

Also it holds that

(since the variance satisfies \mathbb V (a X+ bY) = a^2\mathbb V(X) + b^2\mathbb V(Y) for X and Y independent). So here we see the errors go down at rate \frac{1}{\sqrt{N}}.

A Classical Example. Let \theta = (\theta_1, \theta_2) where \theta_1, \theta_2 \sim U[0,1] are independent. Let Note that the area of the quarter circle \{\theta \in [0,1]^2 : \theta_1^2 + \theta_2^2 \leq 1 \} is \pi/4. Then

The rate of convergence in this example is pretty atrocious when compared with numerical methods. 1 However the example gets the main idea across: there is some difficult to calculate quantity (namely \pi), we generate random variables \theta we do a calculation to get a random variable of interest f(\theta ) and then we repeat until we get a good average. The method is extremely simple and generalizable (to situations where other numerical methods are not readily available).

Importance Sampling.

If we want to calculate

we don’t need to sample from \mu, we can sample from another distribution \nu instead (and this can help improve convergence). We can use \nu instead of \mu because


(Above \frac{d \mu}{d \nu}(\theta) is the probability density function (pdf) of \mu over the pdf of \gamma for continuous random variables or is the probability mass function of \mu over \nu for discrete random variables, and in general is the Radon-Nikodym derivative.)

Thus when applying important sampling, we sample \theta_1,...,\theta_N and we perform the estimate

The following lemma, although not entirely practical, gives good insights as to why importance sampling can help

Lemma [the Perfect Importance Sampler] If Z:=f(\theta)\geq 0 and we sample from \nu with

then the estimator \tilde Z=Z\frac{d \mu}{d \nu} is such that


where the last inequality follows by . Therefore \mathbb V_\nu(\tilde Z) = 0. \square

The above suggest that we should choose \theta with probability proportional to |Z|=|f(\theta)| to get low variance.2 Of course, we don’t know f(\theta) in advance, so we cannot sample in this way. However, in practice, any sampling mechanism that concentrates selection around the area of interest would likely have a good impact on performance. Indeed importance sampling can substantially improve selection related to sampling from the underlying distribution \mu.

Self-Normalized Importance Sampling.

In importance sampling, we apply a weight w(\theta_i) = \frac{d \mu}{d \nu}(\theta_i) to each sample, here we know that \mathbb E_{\theta \sim \nu} [ w(\theta)]=1. However, sometimes we only know these weights upto some constant (i.e. we don’t know the correct normalizing constant which happens a lot in Bayesian statistics) In that case, we can renormalize with the following self-normalized importance sample:

So long as the weights remain bounded, the rate of convergence is comparable to that of MCMC.

Proposition. If the weights w(\theta_i) are bounded then for all bounded function f it holds that

Proof. Note that for \bar w (\theta_i) := w(\theta_i) / \mathbb E[w(\theta_i)]

Note that the term in square brackets is bounded by f_{\max}. Thus applying this equality we have that

Now notice that, since \mathbb E_\nu [ \bar w(\theta)f(\theta) ] = \mathbb E_\mu[f(\theta)], we have that

The above inequality applies to both terms in (by taking f=1). Thus we have, as required, that



Monte-Carlo is by now a very standard method. Buffon’s Needle and the calculation of \pi by Ulam and Von Neumann and coauthors are classical early examples, see Metropolis. The texts of Kroese et al. and Asmussen and Glynn provide good text book accounts.

Metropolis, N. “The Beginning of the Monto-Carlo Method.” Los Alamos Science 15 (1987): 125-130.

Asmussen, Søren, and Peter W. Glynn. Stochastic simulation: algorithms and analysis. Vol. 57. Springer Science & Business Media, 2007.

Kroese, Dirk P., Thomas Taimre, and Zdravko I. Botev. Handbook of monte carlo methods. Vol. 706. John Wiley & Sons, 2013.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s