Mixing Times

In loose terms, the mixing time is the amount of time to wait before you can expect a Markov chain to be close to its stationary distribution. We give an upper bound for this.

We consider the following setup:

  • We consider a discrete time Markov chain on finite state space \mathcal X and transition matrix P.
  • We assume that the Markov chain is irreducible and aperiodic and has stationary distribution, \pi.
  • We will, at the end, assume that our Markov chain is reversible,1 that is for all x,y\in\mathcal X it satifies detailed balance

Here is a rough description of the main idea. We could diagonalize P and get its eigenvalues \lambda_1,...,\lambda_n, then the entries of P=A^{-1}\Lambda A look like

where \alpha_i could be perhaps be polynomial in n. In this expression the stationary distribution \pi_y is the coefficient of the eigenvalue of value 1. All other terms go to zero. The one that goes slowest is the largest eigenvalue not equal to 1. This determines the rate at which the Markov chain reaches stationarity. We now make this more formal.

We want to compare the marginal distribution of the Markov chain at a fixed time to the stationary distribution.

  • For two probability distribution \nu and \pi the total variation distance is

  • For each \epsilon>0, the mixing time with respect to the total variation norm is

We prove some results about a mixing times with respect to a norm which is much more taylored to our chain; we then go back to total variation.

  • For f:\mathcal X\rightarrow\mathbb R and g:\mathcal X\rightarrow\mathbb R, we define the norm || f ||_{\pi} and inner product <f,g>_\pi by

  • Note that the the distance || \cdot ||_{\pi} is convex. So in the above definition, once we have reached the mixing time from each initial state x\in\mathcal X, the chain is mixed from all initial distributions.

For the following proposition, we define

The mixing time \tau_\pi(\epsilon) is intended to give the first time that the relative mass between P_{xy}^n and \pi_x equate within an \epsilon error. The proof is boarders on being a direct consequence of our definitions and assumptions.

Prop 1.

Proof. Observe that

So, in short, Pf^n_x=f_x^{n+1}. Thus, iteratively applying this and the definitions of ||P||_\pi we have that The final inequality is a straightforward bound on || f^0_x -1 ||^2_\pi. Thus \tau_\pi has occured before for any n satisfying

Screenshot 2019-04-18 at 20.12.13.png

rearranging for n and using the bound \log x\geq (x-1) gives the above result. \square

The above bound gives the main flavour, we expect convergence to stationarity to be exponentially fast. Still, there is work left: our distance metric depends on our stationary distribution and we need to guarentee that ||P||_\pi < 1. We will show that ||P||_\piis an eigenvalue of P, in particular, the magnitude of the largest eigenvalue (excluding the trival unit eigenvalue that all Markov chains must have).

We use the following lemma (which we don’t currently prove). The result for the better part is a statement of Spectral Theorem, which states an analogous result for symmetric matrices.2 As a consequence, the lemma below involves making the observation that for a reversible chain S_{xy}=\pi_x^{1/2}P_{xy}\pi_y^{-1/2} is a symmetric matrix.

Lemma 1. For the transition matrix of a reversible Markov chain, eigenvalues of are all real valued each with magnitude less than one (with one attaining this value) and their eigenvectors form an orthogonal basis (with repect to ||\cdot ||_\pi).

However, we use it to prove the following lemma

Lemma 2.

Proof. The definition of ||P||_\pi, , is defined over the supremum of f with \mathbb E f =0. We write f with repect to the orthogonal basis of eigenvectors \{\nu^x\}_{x\in\mathcal X}, we have f=\sum_x \phi_x \nu^x, Since \mathbb E f = <f, 1>_\pi = 0, the componentent of \phi corresponding to the largest eigenvalue must be zero, but all others may be chosen freely. Now, since each \nu^x is an eigenvector with value \lambda_x for P, we have

Clearly this is maximized by chosing all off the weight of \phi_x on the eigenvalue of largest magnitude (excluding the unit one which we know must have weight be zero).

The above lemma extracts a definition for ||P||_\pi that can now in prinicple calculated, e.g. by a diagonalizing P. This improves Proposition [Mix:Prop] but the metric used to define out mixing time is still depends on the stationary distribution \pi. We can work with the Total Varaition norm thanks to the following lemma

Basically L^1 norm is less than the L^2 norm (by Jensen’s for instance), so for f_x^n(y)=P^n_{xy}/\pi_y we have

\square

Putting together the last two lemmas and Proposition [Mix:Prop], we have the following result.

Theorem.Screenshot 2019-04-18 at 20.15.20.pngwhere \lambda_{\max}= \max\{ |\lambda| : \lambda \text{ is an eigenvalue of } P \text{ and } \lambda \neq 1\} and \pi_{\min}= \min_i \pi_i.

Literature

The arguments here are based on reading the two sources:

Montenegro, Ravi, and Prasad Tetali. “Mathematical aspects of mixing times in Markov chains.” Foundations and Trends® in Theoretical Computer Science 1.3 (2006): 237-354.

Levin, David A., and Yuval Peres. Markov chains and mixing times. Vol. 107. American Mathematical Soc., 2017.

 


  1. There is a wel developed theory for non-reversible chains, however, everything is way cleaner for reversible chains plus upper and lower bounds for mixing match.
  2. Basically, symmetric matrices are just a strech of space with repect to the right orthogonal basis.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s