Talagrand’s Concentration Inequality

We prove a powerful inequality which provides very tight gaussian tail bounds “e^{-ct^2}” for probabilities on product state spaces \Omega^n. Talagrand’s Inequality has found lots of applications in probability and combinatorial optimization and, if one can apply it, it generally outperforms inequalities like Azzuma-Hoeffding.

Thrm [Talagrand’ Concentration Inequality] For A a measurable subset of the product space \Omega^n,

and, consequently,

Here,

where, V(x,A)=\{ u: u_k\geq \mathbb I[x_k\neq y_k], \text{ for }k=1,..,n,\text{ for some }y\in A\}.

Proof. The duality between the two characterizations of d(x,A) are given in the appendix to this section.

We will prove the inequality holds by induction on n. Note for the case n=1, d(x,A)=\mathbb I[x\notin A] so holds since e^{-1/4}(1-\mathbb P(A)) + \mathbb P(A) \leq \mathbb P(A)^{-1}.1

Now assuming our induction hypothesis up to n, we prove the result for the n+1 case. We do so by proving the following claim


Claim: For z=(w,x)\in\Omega^{n+1}, w\in\Omega and x\in\Omega^{n},

where,

  • In equality , we over estimate d(z,A) by two distances on \Omega^n: the first, a refined distance using information about w, d(x,A(w)); and the second, a cruder distance using only information about x and A, d(x,\tilde{A}). A(w)=\tilde{A} for rectangular sets only.

Proof of claim: We prove the bound by considering the convex combination of points in V(x,A(w)) and V(x,\tilde{A}). Note if v_w is a point in V(x,A(w)) then there exists a y\in A(w) such that v_k \geq \mathbb I[x_k\neq y_k] for each k=1,...,n, also 0\geq \mathbb I[w=w] for the k=n+1 case. So, if v_w\in V(x,A(w)) then (v_w,0)\in V(z,A). By a similar easy argument, we see that if \tilde{v}\in V(x,\tilde{A}) then (\tilde{v},1)\in V(z,A). It is also clear that these inclusions extend on convex combinations <V(x,A(w))> and <V(x,\tilde{A})>. Thus we see that if v_w\in V(x,A(w)) and \tilde{w}\in V(x,\tilde{A}) then

Also, using convexity of ||\cdot||_2^2,

Optimizing over <V(z,A)>, <V(x,A(w))> and <V(x,\tilde{A})>, as required, we get

This completes the proof of our claim and we continue with the proof of Talagrand’s Inequality.


Continuing to write z=(w,x)\in \Omega\times \Omega^n and applying of inequality , we have for the conditional expectation that For second inequality, above, we apply Holder’s Inequality; in the third inequality, we apply our induction hypothesis; in the fourth inequality, we use the fact that \inf_{0 \leq \lambda \leq 1} r^{-\lambda}e^{(1-\lambda)^2/4} \leq 2-r, which we prove in the appendix to this section.

Now taking expectations over w, and observing x(2-x)\leq 1, we gain the required result:

\square


Applications

Let’s informally discuss the application of Talagrand’s Inequality and then give a few examples of how to apply the result. First we recall that the Hamming distance is d_H(x,y) =\sum_{i=1}^n \mathbb I[x_i\neq y_i], so Talagrand considers a weighted Hamming distance from the set of y\in A, d_{\alpha,H}(x,A):=\inf_{y\in A} \sum_{i=1}^n \alpha_i\mathbb I[x_i \neq y_i]. A result of a similar form was known to hold previously to Talagrand’s, namely,2

However the bound is vastly improved by allowing us to optimize over \alpha given x: from an event A=\{ F(x)\leq s\} it is much more possible to deduce that for some s', \{F(x) \geq s'\} \subset \{ \sup_\alpha d_{\alpha,H} (x,A) \geq t\}. In particular, we are in a strong position if we can show a bound of the form

Note by multiplying by -1 both bounds are equivalent. In both cases, we imagine that, if making a change so some of the compenents of x gives y, then we have a way of estimating the change in F(x) for each component changed. Note the (left) bound above implies

Thus if we can bound ||\alpha^x||_2 above by a, (the smaller the better) and we take an event A=\{ F(x) \geq s\} then d(x,A) \geq (s-F(x)) a^{-1}. Thus, for this choice of A, F(x) \leq s' implies d(x,A) \geq (s-s') a^{-1}. Thus applying Talagrand gives

For median, m, inspecting the above bound with s=m and s'=m-t and then with s=m+t and s'=m yields a bound

At this point, we have a pretty convincing concentration bound about the median of the distribution of F(x).

One might prefer work with the mean of F(x), \mu, rather than the median, m. Note, a Cantelli’s inequality implies that the the mean is withing one standard deviation (see Lemma [Ineqs:MeanMedian]). We can estimate the variance of F(x) from the above concentration inequality: the variance about the mean is always minimal so

So, |\mu - m|\leq 4 a which gives |F(x)-\mu| \leq |F(x)-m| + 4a and so

So once again we get a tail-bound with the same concentration behaviour.

Longest Increasing Subsequence

Given independent random variables X=(X_1,X_2,...,X_N) are independent real valued random variables, we consider the problem of finding the following Longest Increasing Subsequence:

Comparing two (similar) sequences x and y notice a longest subsequence of x, S(x) could be used as a (long) subsequence for y. That is

We take \alpha such that \alpha_i= {I(x)}^{-1/2} if i\in S(x) and \alpha_i=0 otherwise. If we consider the distance d_\alpha(x,y) = {I(x)}^{-1/2} \cdot | i\in S(x) : x_i \neq y_i |, then we have

which, on the event A=\{y : I(y) \leq s \}, gives

Note given we are working with an event of the form A=\{ I(x) \leq s\} we look for an event \{I(x)\geq s'\} that implies d(x,A) is of a certian size. Since the function \frac{z-s}{\sqrt{z}} is increasing for z\geq s, then I(x) \geq s+t implies \frac{I(x)-s}{\sqrt{I(x)}} \geq \frac{t}{\sqrt{t+s}}. Applying this then Talagrands inequality gives

For m the median of this distibution, then setting s=m-t and setting s=m gives two tail bounds in the following result.

Theorem

Note the bounds do not look not terribly symmetric but, in order to achieve a guaranteed probability in both bounds, we must choose t of the same order, O(m), in both bounds.

Appendix

We first show the equality of our primal-dual characterization for d(x,A).

For r\geq 0

It is that this is equivalent to showing that

The left-hand side is minimized by \lambda= 1+2\log(r). Thus, after substituting, we are required to show

We can equivalently expressed this inequality as an integral

The term in the square brackets is negative: 2\log x-x+1 is concave and x\log x is convex, both terms are equal when x=1 so the rest of the time 2\log x-x+1< x\log x. This completes the lemma. \square


  1. Hint: the left-hand side is increasing in \mathbb P(A) and the right is decreasing in \mathbb P(A) and they are equal at \mathbb P(A)=1.
  2. See, Theorem 3.1 of McDiarmid, Colin. “Concentration.” Probabilistic methods for algorithmic discrete mathematics. Springer Berlin Heidelberg, 1998. 195-248.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s