Mixing Times

In loose terms, the mixing time is the amount of time to wait before you can expect a Markov chain to be close to its stationary distribution. We give an upper bound for this.

We consider the following setup:

We consider a discrete time Markov chain on finite state space $\mathcal X$ and transition matrix $P$ .
We assume that the Markov chain is irreducible and aperiodic and has stationary distribution, $\pi$ .
We will, at the end, assume that our Markov chain is reversible,¹ that is for all $x,y\in\mathcal X$ it satifies detailed balance

$\pi_x P_{xy} = \pi_y P_{yx}.$

Here is a rough description of the main idea. We could diagonalize $P$ and get its eigenvalues $\lambda_1,...,\lambda_n$ , then the entries of $P=A^{-1}\Lambda A$ look like

$P_{xy}^n=\sum_i \alpha_i \lambda_i^n.$

where $\alpha_i$ could be perhaps be polynomial in $n$ . In this expression the stationary distribution $\pi_y$ is the coefficient of the eigenvalue of value $1$ . All other terms go to zero. The one that goes slowest is the largest eigenvalue not equal to $1$ . This determines the rate at which the Markov chain reaches stationarity. We now make this more formal.

We want to compare the marginal distribution of the Markov chain at a fixed time to the stationary distribution.

For two probability distribution $\nu$ and $\pi$ the total variation distance is

$|| \nu - \pi ||_{TV} = \frac{1}{2} \sum_{x\in\mX} | \nu_x - \pi_x| .$

For each $\epsilon>0$ , the mixing time with respect to the total variation norm is

$\tau_{{TV}}(\epsilon) = \min \Big\{ n \geq 0 : \Big|\Big| P^n_{x,\cdot} - \pi \Big|\Big|_{TV} \leq \epsilon, \forall x\in\mX \Big\}.$

We prove some results about a mixing times with respect to a norm which is much more taylored to our chain; we then go back to total variation.

For $f:\mathcal X\rightarrow\mathbb R$ and $g:\mathcal X\rightarrow\mathbb R$ , we define the norm $|| f ||_{\pi}$ and inner product $<f,g>_\pi$ by

$|| f ||_{\pi}^2= \sum_{x\in\mX} \pi_x f(x)^2,\qquad <f,g>_\pi = \sum_{x\in\mX} \pi_x f_x g_x.$

Note that the the distance $|| \cdot ||_{\pi}$ is convex. So in the above definition, once we have reached the mixing time from each initial state $x\in\mathcal X$ , the chain is mixed from all initial distributions.

For the following proposition, we define

$f^n_x(y) = \frac{P^n_{xy}}{\pi_y}\qquad\text{and}\qquad \tau_{\pi}(\epsilon) = \min \Big\{ n \geq 0 : \Big|\Big| f^n_x - 1 \Big|\Big|_{\pi} \leq \epsilon, \forall x\in\mX \Big\}.$

The mixing time $\tau_\pi(\epsilon)$ is intended to give the first time that the relative mass between $P_{xy}^n$ and $\pi_x$ equate within an $\epsilon$ error. The proof is boarders on being a direct consequence of our definitions and assumptions.

Prop 1. $\tau_{\pi}(\epsilon) \leq \frac{1}{1- ||{P}||} \log \frac{1}{\epsilon \sqrt{\pi_{\min}}} \qquad\text{ where }\quad \pi_{\min}= \min_i \pi_i\quad\text{ and }\quad || {P}|| = \sup_{\substack{f:\mX\rightarrow \bR\\ \bE f =0 }} \frac{||{P} f ||_{2,\pi}}{||f||_{2,\pi}}.\label{PNorm}$

Proof. Observe that $\sum_y P_{zy}f^n_x(y) = \sum_y \frac{P_{zy}}{\pi_y} P^n_{xy}= \sum_y P^n_{xy}\frac{P_{yz}}{\pi_z} = \frac{P^{n+1}_{xz}}{\pi_z} = f_x^{n+1}(z).$

So, in short, $Pf^n_x=f_x^{n+1}$ . Thus, iteratively applying this and the definitions of $||P||_\pi$ we have that $|| f_x^{n} - 1 ||_\pi = || P (f_x^{n-1} - 1) ||_\pi \leq || P ||_\pi \cdot || f_x^{n-1} -1 ||_\pi \leq ... \leq || P ||^n_\pi \cdot || f_x^{0} -1 ||_\pi \leq ||P||^n_\pi \cdot \frac{1}{\sqrt{\pi_{\min}}} .$ The final inequality is a straightforward bound on $|| f^0_x -1 ||^2_\pi$ . Thus $\tau_\pi$ has occured before for any $n$ satisfying

Screenshot 2019-04-18 at 20.12.13.png

rearranging for $n$ and using the bound $\log x\geq (x-1)$ gives the above result. $\square$

The above bound gives the main flavour, we expect convergence to stationarity to be exponentially fast. Still, there is work left: our distance metric depends on our stationary distribution and we need to guarentee that $||P||_\pi < 1$ . We will show that $||P||_\pi$ is an eigenvalue of $P$ , in particular, the magnitude of the largest eigenvalue (excluding the trival unit eigenvalue that all Markov chains must have).

We use the following lemma (which we don’t currently prove). The result for the better part is a statement of Spectral Theorem, which states an analogous result for symmetric matrices.² As a consequence, the lemma below involves making the observation that for a reversible chain $S_{xy}=\pi_x^{1/2}P_{xy}\pi_y^{-1/2}$ is a symmetric matrix.

Lemma 1. For the transition matrix of a reversible Markov chain, eigenvalues of are all real valued each with magnitude less than one (with one attaining this value) and their eigenvectors form an orthogonal basis (with repect to $||\cdot ||_\pi$ ).

However, we use it to prove the following lemma

Lemma 2.

$|| P ||_\pi = \lambda_{\max}\qquad \text{ where }\qquad \lambda_{\max}= \max\{ |\lambda| : \lambda \text{ is an eigenvalue of } P \text{ and } \lambda \neq 1\}.$

Proof. The definition of $||P||_\pi$ , , is defined over the supremum of $f$ with $\mathbb E f =0$ . We write $f$ with repect to the orthogonal basis of eigenvectors $\{\nu^x\}_{x\in\mathcal X}$ , we have $f=\sum_x \phi_x \nu^x$ , Since $\mathbb E f = <f, 1>_\pi = 0$ , the componentent of $\phi$ corresponding to the largest eigenvalue must be zero, but all others may be chosen freely. Now, since each $\nu^x$ is an eigenvector with value $\lambda_x$ for $P$ , we have

$\frac{|| Pf ||^2_\pi}{||f||^2_\pi} = \frac{|| \sum_x \phi_x\lambda_x\nu^x||^2_\pi}{ || \sum_x \phi_x \nu^x ||^2_\pi} = \frac{\sum_x \phi_x^2 \lambda_x^2}{\sum_x \phi_x^2}.$

Clearly this is maximized by chosing all off the weight of $\phi_x$ on the eigenvalue of largest magnitude (excluding the unit one which we know must have weight be zero).

The above lemma extracts a definition for $||P||_\pi$ that can now in prinicple calculated, e.g. by a diagonalizing $P$ . This improves Proposition [Mix:Prop] but the metric used to define out mixing time is still depends on the stationary distribution $\pi$ . We can work with the Total Varaition norm thanks to the following lemma

$\tau_{TV}(\epsilon) \leq \tau_\pi (\epsilon).$

Basically $L^1$ norm is less than the $L^2$ norm (by Jensen’s for instance), so for $f_x^n(y)=P^n_{xy}/\pi_y$ we have $|| P^n_{x, \cdot} -\pi ||_{TV} = \frac{1}{2}\sum_{y\in\mX} \pi_y \Big| f^n_{x}(y) - 1 \Big| \leq \left( \sum_{y\in\mX} \pi_y \Big( f^n_{x}(y) - 1 \Big)^2 \right)^{\frac{1}{2}}= \Big|\Big| f^n_x - 1 \Big|\Big|_{\pi}.$

$\square$

Putting together the last two lemmas and Proposition [Mix:Prop], we have the following result.

Theorem. Screenshot 2019-04-18 at 20.15.20.png where $\lambda_{\max}= \max\{ |\lambda| : \lambda \text{ is an eigenvalue of } P \text{ and } \lambda \neq 1\}$ and $\pi_{\min}= \min_i \pi_i$ .

Literature

The arguments here are based on reading the two sources:

Montenegro, Ravi, and Prasad Tetali. “Mathematical aspects of mixing times in Markov chains.” Foundations and Trends® in Theoretical Computer Science 1.3 (2006): 237-354.

Levin, David A., and Yuval Peres. Markov chains and mixing times. Vol. 107. American Mathematical Soc., 2017.

There is a wel developed theory for non-reversible chains, however, everything is way cleaner for reversible chains plus upper and lower bounds for mixing match.↩
Basically, symmetric matrices are just a strech of space with repect to the right orthogonal basis.↩

Share this:

Leave a comment Cancel reply