Sequential Monte Carlo (SMC)

Sequential Monte-Carlo is a general method of sampling from a sequence of probability distributions \hat \eta_1,...,\hat \eta_t.

Here, we have the update equations

Notice if we take W_t(x) =\hat \eta_t(x)/\eta_t(x) then any sequence of distributions \hat \eta_1,...,\hat \eta_t can be realized by the above recursion.1 Except special cases, we cannot calculate the integrals above when the state space \mathcal X is infinite (or simply large).

However, we can approximately sample from the required distribution and use that as a proxy. We can then update with the same equations as above replacing the continuous \eta_t with the discrete \eta^N_t which is achieved by taking N samples. This is called Sequential Monte-Carlo (SMC).

More precisely, SMC does the following:

1) Resample. For i=1,...,N

2) Reweight.

where

It is worth noting that the resampled RVs x^i_t from the distribution \hat \eta_t and thus are a random subset of \{\hat x^i_t : i=1,..,N\}. Since we then apply transitions P(\hat x|x) we should, in general, recover N distinct points. We assume that it is easy to sample from \hat \eta_1. Further notice, the algorithm is always dealing in ratios of the distributions \eta_t. Thus we do not have to specify normalizing constants for these distributions if we do not know them (which is often the case in the world of Bayesian statistics). A final important point, is that although we assume \hat x^i_t to be a state in the set \mathcal X. We could allow it to be the set of previous states \hat{x}^i_t = ( \hat{x}^i_s : s\leq t). This is what we do in the hidden Markov models (discussed below).


Roughly why this works. Although we can say more precisely why things work out for this algorithm, let’s sketch out why it works. Suppose that \eta_t^N is a good approximation of \eta_t. So much so that we assume that \eta^N_t=\eta_t (and thus W_t(x) \propto \eta_t(x)). Notice that from infinitely many resamples from \hat \eta_t and P gives:

We can apply the strong law of large numbers to each of the two sums above. Notice that

This holds for f and f\equiv 1. Thus applying to , we see that

Thus we see that in the limit where N is large, we should be sampling from the correct distribution.

Hidden Markov Models

As an example SMC can be used for Hidden Markov models. Suppose that (\hat x_0, ...., \hat x_T) us a Markov chain with \hat x_0 \sim \lambda and transition distributions P(\hat x | x). Suppose that we receive observations O_t as a function of \hat x_t. That is I.e. the distribution of O_t is conditionally independent of (\hat x_0,...,\hat x_T) when we condition on \hat x_t.

Like with the Kalman filter we wish to calculate

  • \eta_t(x) := p(x_t | o_0,....,o_{t-1})

  • \hat \eta_t(x) := p(x_t | o_0,....,o_{t})

  • Z_t:=p(y_0,...,y_t)

Letting W_t(x_t) = q(o_t| x_t) and \eta_0(x)\sim \lambda and \eta_0 (x) =\lambda(x) and Z_0= \int Q_0(x) \eta_0(dx), holds that

If we are given P(x'|x) and q(o | x), and if the state space of our Markov chain \mathcal X is finite, then we can calculate all the distributions above. However, this is not possible when \mathcal X is not finite and instead we can for instance use Sequential Monte-Carlo which we define next.

Convergence Proof.

We now prove the convergence of SMC (in L_2 norm). Just like Monte-Carlo the standard deviation error goes down at a rate 1/\sqrt{N}. We require the weights W_t(x) to remain bounded by some value W_{\max}.

Theorem. For all bounded continuous f it holds that

where c_t is a positive constant.

Proof. The proof proceeds by induction applying the same bounds that we proved from standard Monte-Carlo and self-normalized importance sampling (cf. and Proposition [MC:SNI]).

Notice at time t=0, \hat x_0^i \sim \hat \eta_0. Thus by standard Monte-Carlo bounds:

So the result holds at t=0.

We now suppose the induction hypothesis that the result holds at time t and we prove that it holds at time t+1.

First notice that, similar to self-normalized importance sampling, we can replace the self-normalized sum with a non-self-normalized sum. Specifically,

Note that the term in square bracket is bounded by f_{\max}. Thus, recalling that the L_2 norm is defined by \|X \|_{L_2} := \mathbb E [X^2]^{1/2} , In the final inequality above, we now that \hat \eta_{t+1}(\hat x) = \int W(\hat x)P(\hat x | x) d \hat \eta_t (x).

We analyze the term , noting that the term is the same when we take f=1. Now note that

Thus for we have

The first term in follows by the same argument that we used to derive . The second term follows by our induction hypothesis applied to the function \tilde f(x) = \int f(\hat x) W( \hat x) P(\hat x|x)d\hat x. Thus, applying the above inequality to (and ), we see that

Thus we see that the require bound holds at time t+1 with This completes the induction step and the proof. \square


  1. Here we have the caveat that we require \eta_t(x)>0 for any \hat \eta_t(x)>0.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: