# Sequential Monte Carlo (SMC)

Sequential Monte-Carlo is a general method of sampling from a sequence of probability distributions $\hat \eta_1,...,\hat \eta_t$.

Here, we have the update equations

Notice if we take $W_t(x) =\hat \eta_t(x)/\eta_t(x)$ then any sequence of distributions $\hat \eta_1,...,\hat \eta_t$ can be realized by the above recursion.1 Except special cases, we cannot calculate the integrals above when the state space $\mathcal X$ is infinite (or simply large).

However, we can approximately sample from the required distribution and use that as a proxy. We can then update with the same equations as above replacing the continuous $\eta_t$ with the discrete $\eta^N_t$ which is achieved by taking $N$ samples. This is called Sequential Monte-Carlo (SMC).

More precisely, SMC does the following:

1) Resample. For $i=1,...,N$

2) Reweight.

where

It is worth noting that the resampled RVs $x^i_t$ from the distribution $\hat \eta_t$ and thus are a random subset of $\{\hat x^i_t : i=1,..,N\}$. Since we then apply transitions $P(\hat x|x)$ we should, in general, recover $N$ distinct points. We assume that it is easy to sample from $\hat \eta_1$. Further notice, the algorithm is always dealing in ratios of the distributions $\eta_t$. Thus we do not have to specify normalizing constants for these distributions if we do not know them (which is often the case in the world of Bayesian statistics). A final important point, is that although we assume $\hat x^i_t$ to be a state in the set $\mathcal X$. We could allow it to be the set of previous states $\hat{x}^i_t = ( \hat{x}^i_s : s\leq t)$. This is what we do in the hidden Markov models (discussed below).

Roughly why this works. Although we can say more precisely why things work out for this algorithm, let’s sketch out why it works. Suppose that $\eta_t^N$ is a good approximation of $\eta_t$. So much so that we assume that $\eta^N_t=\eta_t$ (and thus $W_t(x) \propto \eta_t(x)$). Notice that from infinitely many resamples from $\hat \eta_t$ and $P$ gives:

We can apply the strong law of large numbers to each of the two sums above. Notice that

This holds for $f$ and $f\equiv 1$. Thus applying to , we see that

Thus we see that in the limit where $N$ is large, we should be sampling from the correct distribution.

## Hidden Markov Models

As an example SMC can be used for Hidden Markov models. Suppose that $(\hat x_0, ...., \hat x_T)$ us a Markov chain with $\hat x_0 \sim \lambda$ and transition distributions $P(\hat x | x)$. Suppose that we receive observations $O_t$ as a function of $\hat x_t$. That is I.e. the distribution of $O_t$ is conditionally independent of $(\hat x_0,...,\hat x_T)$ when we condition on $\hat x_t$.

Like with the Kalman filter we wish to calculate

• $\eta_t(x) := p(x_t | o_0,....,o_{t-1})$

• $\hat \eta_t(x) := p(x_t | o_0,....,o_{t})$

• $Z_t:=p(y_0,...,y_t)$

Letting $W_t(x_t) = q(o_t| x_t)$ and $\eta_0(x)\sim \lambda$ and $\eta_0 (x) =\lambda(x)$ and $Z_0= \int Q_0(x) \eta_0(dx)$, holds that

If we are given $P(x'|x)$ and $q(o | x)$, and if the state space of our Markov chain $\mathcal X$ is finite, then we can calculate all the distributions above. However, this is not possible when $\mathcal X$ is not finite and instead we can for instance use Sequential Monte-Carlo which we define next.

## Convergence Proof.

We now prove the convergence of SMC (in $L_2$ norm). Just like Monte-Carlo the standard deviation error goes down at a rate $1/\sqrt{N}$. We require the weights $W_t(x)$ to remain bounded by some value $W_{\max}$.

Theorem. For all bounded continuous $f$ it holds that

where $c_t$ is a positive constant.

Proof. The proof proceeds by induction applying the same bounds that we proved from standard Monte-Carlo and self-normalized importance sampling (cf. and Proposition [MC:SNI]).

Notice at time $t=0$, $\hat x_0^i \sim \hat \eta_0$. Thus by standard Monte-Carlo bounds:

So the result holds at $t=0$.

We now suppose the induction hypothesis that the result holds at time $t$ and we prove that it holds at time $t+1$.

First notice that, similar to self-normalized importance sampling, we can replace the self-normalized sum with a non-self-normalized sum. Specifically,

Note that the term in square bracket is bounded by $f_{\max}$. Thus, recalling that the $L_2$ norm is defined by $\|X \|_{L_2} := \mathbb E [X^2]^{1/2}$, In the final inequality above, we now that $\hat \eta_{t+1}(\hat x) = \int W(\hat x)P(\hat x | x) d \hat \eta_t (x)$.

We analyze the term , noting that the term is the same when we take $f=1$. Now note that

Thus for we have

The first term in follows by the same argument that we used to derive . The second term follows by our induction hypothesis applied to the function $\tilde f(x) = \int f(\hat x) W( \hat x) P(\hat x|x)d\hat x$. Thus, applying the above inequality to (and ), we see that

Thus we see that the require bound holds at time $t+1$ with This completes the induction step and the proof. $\square$

1. Here we have the caveat that we require $\eta_t(x)>0$ for any $\hat \eta_t(x)>0$.