# Robbins-Monro

We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

Often it is important to find a solution to the equation

by evaluating $g$ at a sequence of points. For instance Newton’s method would perform the updates $x_{n+1} = x_n - g(x_n)/g'(x_n)$. However, Robbins and Monro consider the setting where we cannot directly observe $g$ but we might observe some random variable whose mean is $g(x)$. Thus we observe

and hope solve for $g(x) = 0$. Notice in this setting, even if we can find $g'(x)$, Newton’s method may not converge. The key idea of Robbins and Monro is to use a schema where

where we chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

Before proceeding here are a few different use cases:

• Quartiles. We want to find $x$ such that $P(X \leq x ) = p$ for some fixed $p$. But we can only sample the random variable $X$.
• Regression. We preform regression $g(x) = \beta_0 + \beta_1 x$, but rather than estimate $\beta$ we want to know where $g(x)=0$.
• Optimization. We want to optimize a convex function $f(x)$ whose gradient is $g(x)$. Assume that $f(x) = \sum_{k=1}^K f_k(x)$ for some large $K$. To find the optimum at $g(x)=0$ we randomly sample (uniformly) $f_k(x)$ whose gradient, $g_k(x)$, is an bias estimate of $g(x)$.

The following result contains the key elements of the Robbins-Monro proof

Prop 1. Suppose that $z_{n}$ is a positive sequence such that

where $a_n$ and $c_n$ are positive sequences such that

then $\lim_{n\rightarrow\infty} z_n = 0$.

Proof. We can assume that equality holds, i.e., $z_{n+1} = z_n (1-a_n) + c_n$. We can achieve this by increasing $a_n$ or decreasing $c_n$ in the inequality ; neither of which effect the conditions on $a_n$ and $b_n$, .

Now for all $n$ we have the following lower-bound

Since $\sum c_k < \infty$ it must be that $\sum a_kz_k < \infty$. Thus since both sums converge it must be that $\lim_n z_n$ converges. Finally since $\sum a_k = \infty$ and $\sum a_kz_k < \infty$ it must be that $\lim_n z_n =0$. $\square$

The following proposition is a Martingale version of the above result.

Thrm 1 [Robbins-Siegmund Theorem] If

for positive adaptive RVs $Z_n,a_n, b_n, c_n$ such that with probability 1,

then $\lim_{n\rightarrow\infty} z_n = 0$.

Proof. The results is some manipulations analogous to the Robbins-Monro proof and a bunch of nice reductions to Doob’s Martingale Convergence Theorem.

First note the result is equivalent to proving the result with $b_n = 0$ for all $n$. If we divide both sides of by $\prod_{m=0}^n(1-b_m)$ we get

where $a'_n = a_n / (1+b_n)$, $c'_n = c_n / \prod_{m=0}^n(1+b_m)$ and $Z'_n = Z_n / \prod_{m=0}^n(1+b_m)$. Notice since $\sum b_n$ converges then so does $\prod (1+b_n)$. Thus $a'_n$, $c'_n$ and $Z'_n$ have the same convergence properties as those required for $a_n$, $c_n$ and $Z_n$. Thus, we now assume $b_n=0$ for all $n$.

Now notice

is a super-martingale. We want to use Doob’s Martingale convergence theorem; however, we need to apply a localization argument to apply this. Specifically, let $\tau_C = \inf \{ n \geq 0 : \sum_{k=1}^n c'_k > C \}$. This is a stopping time. Notice

So $Y_{n\wedge\tau_C}$ is a super-martingale and below by $-C$. Thus by Doob’s Martingale Convergnce Theorem, $\lim_{n\rightarrow\infty} Y_{n\wedge\tau_C}$ exists for each $C>0$, and $\tau_C = \infty$ for some $C$, since $\sum c'_k <\infty$. Thus $\lim_{n\rightarrow\infty} Y_{n}$ exists.

Now notice

So like in the last proposition, since $\lim Y_n$ and $\sum c'_k$ exists, we see that $\sum_{k=1}^\infty a'_k Z'_k$ converges. And thus $Z'_{n+1}$ converges.

Finally since we assume $\sum_k a'_k = \infty$ and we know that $\sum_{k=1}^\infty a'_k Z'_k < \infty$ it must be that $Z'_k$ converges to zero.

Suppose that we have some function $F : \mathbb R^p \rightarrow \mathbb R$

that we wish to minimize. We suppose that the function $f(X; \theta)$ is known and so is its gradient $g(\theta;X)$, where $\mathbb E [g(\theta;X) ] = G(\theta)$ is the gradient of $F(\theta)$. The difficulty is that we do not have direct access to the distribution of $X$, but we can draw random samples $X_1,X_2,\dots$. We can use the Robbins-Monro Schema to optimize $F(\theta)$. Specifically we takewhere $g_n (\theta ) = g(\theta ; X_n)$ and $\epsilon_n = G(\theta) - g_n (\theta )$ . The above sequence is often referred to as Stochastic Gradient Descent. We chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

(Note here we may assume that $\alpha_n$ is a function of previous parameters and observations $\theta_1,...,\theta_{n-1}$ and $X_1,...,X_{n-1}$.) We let $||\cdot ||_2$ be the Euclidean norm. We can prove that convergence $\theta_n$ to the minimizer of $F(\theta)$.

Thrm 2 [Stochastic Gradient Descent] Suppose that $\theta_n$, $G(\cdot)$, and $\epsilon_n$ in Stochastic Gradient Descent satisfy the following conditions

1. $\exists \theta^*$ such that $\forall \theta$, $G(\theta) \cdot ( \theta - \theta^*) \geq \mu || \theta - \theta^* ||_2^2$
2. $||G(\theta_n)||_2^2 \leq A+ B ||\theta_n ||_2^2$
3. $\mathbb E [ || \epsilon_n||_2^2 | \mathcal F_n ] \leq K$

Then $\lim_n \theta_n = \theta^*$ where $\theta^*= \text{argmin}_\theta F(\theta)$ and assuming $\alpha_n$ are deterministic then $\lim \mathbb E [ || \theta_n - \theta^* ||_2^2 ] =0$

Let’s quickly review the conditions above. First consider Condition 1. Note condition $1$ implies moving in the direction of $\theta^*$ always decreases the $F(\theta)$, so $\theta^*$ minimizes $F$. The statement $( G(\theta) - G(\phi) ) \cdot ( \theta - \phi) \geq \mu || \theta - \phi ||^2$ is equivalent to $F(\theta)$ being strongly convex. So this is enough to give Condition 1. Condition 2 can be interpreted as a gradient condition, or that the steps $\theta_n$ do not grow unboundedly. Condition 3 is natural given our analysis so far.

Proof.

Taking expecations with respect to $\mathbb E [ | \mathcal F_n]$ we get

Thus, rearranging

Thus by Thrm 1, $\theta_{n+1} \rightarrow \theta^*$. Further taking expectations on both sides above we have

We can apply Prop 1 (note that $a_n = \alpha_n \mu + \alpha_n^2 B$ will be positive for suitably large $n$), to give that $\mathbb E || \theta_{n+1} - \theta^* ||_2^2 ] \rightarrow 0$ as $n\rightarrow \infty$ as required. $\square$

Finally we remark that in the proof we analyzed $|| \theta_n - \theta^* ||_2$ but equally we could have analyzed $F(\theta_n) - F(\theta^*)$ instead.

## Fixed Points and Asynchronous Update

We now consider Robbins-Monro from a slightly different perspective. Suppose we have a continuous function $F: \mathbb R^p \rightarrow \mathbb R^p$ and we wish to find a fixed point $x^*$ such that $F(x^*) =x^*$. We assume that $F(\cdot)$ is a contraction namely that, for some $\beta \in (0,1)$,

Here $|| x ||_\infty = \max_{i=1,...,p} | x_i |$. (Note this contraction condition implies the existence of a fixed point). (Note the previous analysis was somewhat restricted to euclidean space.) If we suppose that we do not observe $F(x)$ but instead some perturbed version whose mean is $F(x)$, then we can perform the Robbins-Monro update for each component $i=1,...p$:

where $\alpha_i(t)$ is a sequence such that for all $i$

for some constant $C$. Further we suppose that $\epsilon_i(t-1)$ is measurable with respect to $\mathcal F_{t}$, the filtration generated by $\{\alpha_i(s), x_i(s)\}_{s\leq t}$ measurable and

Note that in the above we let the step rule depend on $i$. For instance at each time $t$ we could chose to update one component only at each step, e.g., to update component $i$ only, we would set $\alpha_j(t) = 0$ for all $j \neq i$. Thus we can consider this step rule to be asynchronous.

We can analyze the convergence of this similarly

Thrm 3. Suppose that $F(\cdot)$ is a contraction with respect to $||\cdot||_\infty$ , suppose the vector $x(t)$ obeys the step rule with step sizes satisfying and further suppose the noise terms satisfy , then

where $x^*$ is the fixed point $F(x^*) = x^*$.

We will prove the result under the assumption that $x(t)$ is bounded in $t$, this is the proposition, Prop 3, below. We then prove that $x(t)$ is bounded in $t$ to complete the proof.

Prop 2. If we further assume that

with probability 1 then Thrm 3 holds.

We may assume with out loss of generality that $x^* = 0$, since the recursion above is equivalent to

Given the assumption above that $|| x(t)||_{\infty} \leq D_0$ for all $t$, further define

Here we choose $\epsilon$ so that $(1+2 \epsilon) \beta <1$ so that $D_k \rightarrow 0$. By induction, we will show that, given $|| x(t) ||_{\infty} < D_k$ for all $t\geq \tau_k$ for some $\tau_l$, then there exists a $\tau_{k+1}$ such that for all $t \geq \tau_{k+1}$

We use two recursions to bound the behavior of $x_i(t)$:

for $t\geq \tau_k$, where $W_i(\tau_k) = 0$ and $Y(\tau_k)=0$. We use $W_i(t)$ to summarize the effect of noise on the recursion for $x_i(t)$ and we use $Y_i(t)$ to bound the error arising from the function $F_i(x)$ in the recursion for $x_i(t)$. Specifically we show that

in Lemma 1 below. Further we notice that is a Robbin-Monro recursion for $W_i(t)$ to go to zero and $Y_i(t)$ to go to $\beta D_k$.

Lemma 1. $\forall t_0\geq \tau_k$

Proof. We prove the result by induction. The result is clearly true for $t=\tau_k$. In the inequality above with apply the induction hypothesis on $x_i(t)$ and bounds of $F_i$. The second equality just applies the definitions of $Y_i$ and $W_i$. Similar inequalities hold in the other direction and give the result. $\square$

Lemma 2.

Proof. Letting $W(t)=W(t,0)$, we know

From the Robbins-Siegmund Theorem (Prop 1), we know that

$\square$

Lemma 3.

Proof. Notice The result holds since $\sum \alpha_i (t) = \infty$. $\square$

We can now prove Prop [Tsit:Prop].

Proof of Prop 2. We know that $|| x(t)||_{\infty} \leq D_0$ for all $t$ and we assume $|| x(t)||_{\infty} \leq D_k$ for all $t\geq \tau_k$. By Prop 3 and then by Lemmas 2 and 3

as required. Thus these exists $\tau_{k+1}$ such that $\sup_{t\geq \tau_{k+1}} || x (t) ||_\infty \leq D_{k+1}$. Thus by induction we see that $\sup_{t\geq \tau_{k}} || x (t) ||_\infty$ decreases through sequence of levels $D_k$ as $k\rightarrow \infty$, thus $x(t)$ goes to zero as required.$\square$

### Proving Boundedness of $x(t)$

We now prove that $x(t)$ remains bounded

Prop 3.

To prove this proposition, we define a processes that bounds the max of $||x(t)||_\infty$ from above in increments of size $(1+\epsilon)$. Specfically we let

and we define $G(0) = \max \{ M(0), G_0 \}$ and let

where in the above $k$ is the smallest integer such that $M(t+1) \leq (1+\epsilon)^k G(t)$. Note that $G(t)$ is adapted. Further note $| F_i(x) | \leq \gamma \max \{ \max_j |x_j|,G_0 \}$ for some $G_0$ and $\gamma < 1$, since $\beta y +c \leq \gamma y\vee G_0$ for suitable choice of $\gamma$ and $G_0$. We use $G_0$ in the definition of $G(t)$ above and we choose $\epsilon$ so that $(1+\epsilon) \gamma <1$.

Also we define $\tilde W_i(t_0; t_0) = 0$ and

Notice, like before, $\tilde W$ is a Robbin-Monro recursion that goes to zero and $x_i(t)$ is bounded by a recursion of this type.

Lemma 4. If $G(t) = G(t_0)$ for $t \geq t_0$ then

Proof. The result is somewhat similar to Lemma [Tsit:bdd]. At $t_0$, we have that Now assuming that the bound is true at time $t$.

Above we bound $x(t)$ knowing the bound holds at time $t$ and we bound $F$ using the fact that we know $G(t)$ has not yet increased. We then use the fact we chose $\epsilon$ so that $\gamma (1+\epsilon) < 1$. A similar bound holds on the other side. $\square$

Lemma 5.

Proof. Since $G(t)$ is adapted and from our assumptions on $w_i(t)$, we have that

We know that $\lim_{t\rightarrow\infty} |\tilde W_i(t;0)|$ – the argument for this is identical to Lemma 2. Further notice for all $t \geq t_0$ we have

Thus,

As required both terms on the righthand-side go to zero. $\square$

Proof of Prop 3. To prove the proposition we will show that $G(t)$ at some point must remain bounded. Suppose $t_0$ is a time just after $G(t)$ increased. Note if $x_i(t)$ grew unboundedly then we could chose $t_0$ as large as we like.

So we’d know by Lemma 5 that there exists a $t_0$ such that for all $t \geq t_0$, $|\tilde W_i(t;t_0)| \leq \epsilon/2$. Thus applying this to Lemma 4 we see that

Thus since $M(t) < G(t) (1+\epsilon)$ for all $t \geq t_0$. So $G(t)$ can not longer increase and thus we arrive at a contradiction. $x_i(t)$ must be bounded in $t$. $\square$

Literature

The Robbin-Munro proceedure is due to Robbins and Munro and the Robbin Sigmund Theorem is due to Robbins and Sigmund. Good notes are given by Francis Bach. The Asynchronous implementation is due to Tsitsiklis, though we replace the a couple of results their with the Robbins-Sigmund theorem.

Robbins, Herbert; Monro, Sutton. A Stochastic Approximation Method. Ann. Math. Statist. 22 (1951), no. 3, 400–407. doi:10.1214/aoms/1177729586.

Robbins, Herbert, and David Siegmund. “A convergence theorem for non negative almost supermartingales and some applications.” Optimizing methods in statistics. Academic Press, 1971. 233-257.

Francis Bach. Optimisation et Apprentissage Statistique https://www.di.ens.fr/~fbach/orsay2019.html

Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” Machine learning 16.3 (1994): 185-202.