Robbins-Monro

We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

Often it is important to find a solution to the equation

by evaluating g at a sequence of points. For instance Newton’s method would perform the updates x_{n+1} = x_n - g(x_n)/g'(x_n). However, Robbins and Monro consider the setting where we cannot directly observe g but we might observe some random variable whose mean is g(x). Thus we observe

and hope solve for g(x) = 0. Notice in this setting, even if we can find g'(x), Newton’s method may not converge. The key idea of Robbins and Monro is to use a schema where

where we chose the sequence \{\alpha_n\}_{n=0}^\infty so that

Before proceeding here are a few different use cases:

  • Quartiles. We want to find x such that P(X \leq x ) = p for some fixed p. But we can only sample the random variable X.
  • Regression. We preform regression g(x) = \beta_0 + \beta_1 x, but rather than estimate \beta we want to know where g(x)=0.
  • Optimization. We want to optimize a convex function f(x) whose gradient is g(x). Assume that f(x) = \sum_{k=1}^K f_k(x) for some large K. To find the optimum at g(x)=0 we randomly sample (uniformly) f_k(x) whose gradient, g_k(x), is an bias estimate of g(x).

The following result contains the key elements of the Robbins-Monro proof

Prop 1. Suppose that z_{n} is a positive sequence such that

where a_n and c_n are positive sequences such that

then \lim_{n\rightarrow\infty} z_n = 0.

Proof. We can assume that equality holds, i.e., z_{n+1} = z_n (1-a_n) + c_n . We can achieve this by increasing a_n or decreasing c_n in the inequality ; neither of which effect the conditions on a_n and b_n, .

Now for all n we have the following lower-bound

Since \sum c_k < \infty it must be that \sum a_kz_k < \infty. Thus since both sums converge it must be that \lim_n z_n converges. Finally since \sum a_k = \infty and \sum a_kz_k < \infty it must be that \lim_n z_n =0. \square

The following proposition is a Martingale version of the above result.

Thrm 1 [Robbins-Siegmund Theorem] If

for positive adaptive RVs Z_n,a_n, b_n, c_n such that with probability 1,

then \lim_{n\rightarrow\infty} z_n = 0.

Proof. The results is some manipulations analogous to the Robbins-Monro proof and a bunch of nice reductions to Doob’s Martingale Convergence Theorem.

First note the result is equivalent to proving the result with b_n = 0 for all n. If we divide both sides of by \prod_{m=0}^n(1-b_m) we get

where a'_n = a_n / (1+b_n), c'_n = c_n / \prod_{m=0}^n(1+b_m) and Z'_n = Z_n / \prod_{m=0}^n(1+b_m) . Notice since \sum b_n converges then so does \prod (1+b_n). Thus a'_n, c'_n and Z'_n have the same convergence properties as those required for a_n, c_n and Z_n. Thus, we now assume b_n=0 for all n.

Now notice

is a super-martingale. We want to use Doob’s Martingale convergence theorem; however, we need to apply a localization argument to apply this. Specifically, let \tau_C = \inf \{ n \geq 0 : \sum_{k=1}^n c'_k > C \}. This is a stopping time. Notice

So Y_{n\wedge\tau_C} is a super-martingale and below by -C. Thus by Doob’s Martingale Convergnce Theorem, \lim_{n\rightarrow\infty} Y_{n\wedge\tau_C} exists for each C>0, and \tau_C = \infty for some C, since \sum c'_k <\infty. Thus \lim_{n\rightarrow\infty} Y_{n} exists.

Now notice

So like in the last proposition, since \lim Y_n and \sum c'_k exists, we see that \sum_{k=1}^\infty a'_k Z'_k converges. And thus Z'_{n+1} converges.

Finally since we assume \sum_k a'_k = \infty and we know that \sum_{k=1}^\infty a'_k Z'_k < \infty it must be that Z'_k converges to zero.


Stochastic Gradient Decent

Suppose that we have some function F : \mathbb R^p \rightarrow \mathbb R

that we wish to minimize. We suppose that the function f(X; \theta) is known and so is its gradient g(\theta;X), where \mathbb E [g(\theta;X) ] = G(\theta) is the gradient of F(\theta). The difficulty is that we do not have direct access to the distribution of X, but we can draw random samples X_1,X_2,\dots . We can use the Robbins-Monro Schema to optimize F(\theta). Specifically we takeScreenshot 2019-01-26 at 16.59.04.pngwhere g_n (\theta ) = g(\theta ; X_n) and \epsilon_n = G(\theta) - g_n (\theta ) . The above sequence is often referred to as Stochastic Gradient Descent. We chose the sequence \{\alpha_n\}_{n=0}^\infty so that

(Note here we may assume that \alpha_n is a function of previous parameters and observations \theta_1,...,\theta_{n-1} and X_1,...,X_{n-1}.) We let ||\cdot ||_2 be the Euclidean norm. We can prove that convergence \theta_n to the minimizer of F(\theta).

Thrm 2 [Stochastic Gradient Descent] Suppose that \theta_n, G(\cdot), and \epsilon_n in Stochastic Gradient Descent satisfy the following conditions

  1. \exists \theta^* such that \forall \theta, G(\theta) \cdot ( \theta - \theta^*) \geq \mu || \theta - \theta^* ||_2^2
  2. ||G(\theta_n)||_2^2 \leq A+ B ||\theta_n ||_2^2
  3. \mathbb E [ || \epsilon_n||_2^2 | \mathcal F_n ] \leq K

Then \lim_n \theta_n = \theta^* where \theta^*= \text{argmin}_\theta F(\theta) and assuming \alpha_n are deterministic then \lim \mathbb E [ || \theta_n - \theta^* ||_2^2 ] =0

Let’s quickly review the conditions above. First consider Condition 1. Note condition 1 implies moving in the direction of \theta^* always decreases the F(\theta), so \theta^* minimizes F. The statement ( G(\theta) - G(\phi) ) \cdot ( \theta - \phi) \geq \mu || \theta - \phi ||^2 is equivalent to F(\theta) being strongly convex. So this is enough to give Condition 1. Condition 2 can be interpreted as a gradient condition, or that the steps \theta_n do not grow unboundedly. Condition 3 is natural given our analysis so far.

Proof.

Taking expecations with respect to \mathbb E [ | \mathcal F_n] we get

Thus, rearranging

Thus by Thrm 1, \theta_{n+1} \rightarrow \theta^*. Further taking expectations on both sides above we have

We can apply Prop 1 (note that a_n = \alpha_n \mu + \alpha_n^2 B will be positive for suitably large n), to give that \mathbb E || \theta_{n+1} - \theta^* ||_2^2 ] \rightarrow 0 as n\rightarrow \infty as required. \square

Finally we remark that in the proof we analyzed || \theta_n - \theta^* ||_2 but equally we could have analyzed F(\theta_n) - F(\theta^*) instead.


Fixed Points and Asynchronous Update

We now consider Robbins-Monro from a slightly different perspective. Suppose we have a continuous function F: \mathbb R^p \rightarrow \mathbb R^p and we wish to find a fixed point x^* such that F(x^*) =x^*. We assume that F(\cdot) is a contraction namely that, for some \beta \in (0,1),

Here || x ||_\infty = \max_{i=1,...,p} | x_i | . (Note this contraction condition implies the existence of a fixed point). (Note the previous analysis was somewhat restricted to euclidean space.) If we suppose that we do not observe F(x) but instead some perturbed version whose mean is F(x), then we can perform the Robbins-Monro update for each component i=1,...p:

where \alpha_i(t) is a sequence such that for all i

for some constant C. Further we suppose that \epsilon_i(t-1) is measurable with respect to \mathcal F_{t}, the filtration generated by \{\alpha_i(s), x_i(s)\}_{s\leq t} measurable and

Note that in the above we let the step rule depend on i. For instance at each time t we could chose to update one component only at each step, e.g., to update component i only, we would set \alpha_j(t) = 0 for all j \neq i. Thus we can consider this step rule to be asynchronous.

We can analyze the convergence of this similarly

Thrm 3. Suppose that F(\cdot) is a contraction with respect to ||\cdot||_\infty , suppose the vector x(t) obeys the step rule with step sizes satisfying and further suppose the noise terms satisfy , then

where x^* is the fixed point F(x^*) = x^*.

We will prove the result under the assumption that x(t) is bounded in t, this is the proposition, Prop 3, below. We then prove that x(t) is bounded in t to complete the proof.


Prop 2. If we further assume that

with probability 1 then Thrm 3 holds.

We may assume with out loss of generality that x^* = 0, since the recursion above is equivalent to

Given the assumption above that || x(t)||_{\infty} \leq D_0 for all t, further define

Here we choose \epsilon so that (1+2 \epsilon) \beta <1 so that D_k \rightarrow 0. By induction, we will show that, given || x(t) ||_{\infty} < D_k for all t\geq \tau_k for some \tau_l, then there exists a \tau_{k+1} such that for all t \geq \tau_{k+1}

We use two recursions to bound the behavior of x_i(t):

for t\geq \tau_k, where W_i(\tau_k) = 0 and Y(\tau_k)=0. We use W_i(t) to summarize the effect of noise on the recursion for x_i(t) and we use Y_i(t) to bound the error arising from the function F_i(x) in the recursion for x_i(t). Specifically we show that

Screenshot 2019-01-26 at 17.04.34.png

in Lemma 1 below. Further we notice that is a Robbin-Monro recursion for W_i(t) to go to zero and Y_i(t) to go to \beta D_k.

Lemma 1. \forall t_0\geq \tau_k

Screenshot 2019-01-26 at 17.04.34.png

Proof. We prove the result by induction. The result is clearly true for t=\tau_k. In the inequality above with apply the induction hypothesis on x_i(t) and bounds of F_i. The second equality just applies the definitions of Y_i and W_i. Similar inequalities hold in the other direction and give the result. \square

Lemma 2.

Proof. Letting W(t)=W(t,0), we know

From the Robbins-Siegmund Theorem (Prop 1), we know that

\square

Lemma 3.  Screenshot 2019-01-26 at 17.06.38.png

Proof. Notice The result holds since \sum \alpha_i (t) = \infty. \square

We can now prove Prop [Tsit:Prop].

Proof of Prop 2. We know that || x(t)||_{\infty} \leq D_0 for all t and we assume || x(t)||_{\infty} \leq D_k for all t\geq \tau_k. By Prop 3 and then by Lemmas 2 and 3

as required. Thus these exists \tau_{k+1} such that \sup_{t\geq \tau_{k+1}} || x (t) ||_\infty \leq D_{k+1} . Thus by induction we see that \sup_{t\geq \tau_{k}} || x (t) ||_\infty decreases through sequence of levels D_k as k\rightarrow \infty, thus x(t) goes to zero as required.\square


Proving Boundedness of x(t)

We now prove that x(t) remains bounded

Prop 3.

To prove this proposition, we define a processes that bounds the max of ||x(t)||_\infty from above in increments of size (1+\epsilon). Specfically we let

and we define G(0) = \max \{ M(0), G_0 \} and let

where in the above k is the smallest integer such that M(t+1) \leq (1+\epsilon)^k G(t). Note that G(t) is adapted. Further note | F_i(x) | \leq \gamma \max \{ \max_j |x_j|,G_0 \} for some G_0 and \gamma < 1, since \beta y +c \leq \gamma y\vee G_0 for suitable choice of \gamma and G_0. We use G_0 in the definition of G(t) above and we choose \epsilon so that (1+\epsilon) \gamma <1.

Also we define \tilde W_i(t_0; t_0) = 0 and

Notice, like before, \tilde W is a Robbin-Monro recursion that goes to zero and x_i(t) is bounded by a recursion of this type.

Lemma 4. If G(t) = G(t_0) for t \geq t_0 then

Proof. The result is somewhat similar to Lemma [Tsit:bdd]. At t_0, we have that Now assuming that the bound is true at time t.

Above we bound x(t) knowing the bound holds at time t and we bound F using the fact that we know G(t) has not yet increased. We then use the fact we chose \epsilon so that \gamma (1+\epsilon) < 1. A similar bound holds on the other side. \square

Lemma 5.

Proof. Since G(t) is adapted and from our assumptions on w_i(t), we have that

We know that \lim_{t\rightarrow\infty} |\tilde W_i(t;0)| – the argument for this is identical to Lemma 2. Further notice for all t \geq t_0 we have

Thus,

As required both terms on the righthand-side go to zero. \square

Proof of Prop 3. To prove the proposition we will show that G(t) at some point must remain bounded. Suppose t_0 is a time just after G(t) increased. Note if x_i(t) grew unboundedly then we could chose t_0 as large as we like.

So we’d know by Lemma 5 that there exists a t_0 such that for all t \geq t_0, |\tilde W_i(t;t_0)| \leq \epsilon/2. Thus applying this to Lemma 4 we see that

Thus since M(t) < G(t) (1+\epsilon) for all t \geq t_0. So G(t) can not longer increase and thus we arrive at a contradiction. x_i(t) must be bounded in t. \square

Literature

The Robbin-Munro proceedure is due to Robbins and Munro and the Robbin Sigmund Theorem is due to Robbins and Sigmund. Good notes are given by Francis Bach. The Asynchronous implementation is due to Tsitsiklis, though we replace the a couple of results their with the Robbins-Sigmund theorem.

Robbins, Herbert; Monro, Sutton. A Stochastic Approximation Method. Ann. Math. Statist. 22 (1951), no. 3, 400–407. doi:10.1214/aoms/1177729586.

Robbins, Herbert, and David Siegmund. “A convergence theorem for non negative almost supermartingales and some applications.” Optimizing methods in statistics. Academic Press, 1971. 233-257.

Francis Bach. Optimisation et Apprentissage Statistique https://www.di.ens.fr/~fbach/orsay2019.html

Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” Machine learning 16.3 (1994): 185-202.

 

 

One thought on “Robbins-Monro”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: