# Robbins-Monro

We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

Often it is important to find a solution to the equation

by evaluating $g$ at a sequence of points. For instance Newton’s method would perform the updates $x_{n+1} = x_n - g(x_n)/g'(x_n)$. However, Robbins and Monro consider the setting where we cannot directly observe $g$ but we might observe some random variable whose mean is $g(x)$. Thus we observe

and hope solve for $g(x) = 0$. Notice in this setting, even if we can find $g'(x)$, Newton’s method may not converge. The key idea of Robbins and Monro is to use a schema where

where we chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

Before proceeding here are a few different use cases:

• Quartiles. We want to find $x$ such that $P(X \leq x ) = p$ for some fixed $p$. But we can only sample the random variable $X$.
• Regression. We preform regression $g(x) = \beta_0 + \beta_1 x$, but rather than estimate $\beta$ we want to know where $g(x)=0$.
• Optimization. We want to optimize a convex function $f(x)$ whose gradient is $g(x)$. Assume that $f(x) = \sum_{k=1}^K f_k(x)$ for some large $K$. To find the optimum at $g(x)=0$ we randomly sample (uniformly) $f_k(x)$ whose gradient, $g_k(x)$, is an bias estimate of $g(x)$.

The following result contains the key elements of the Robbins-Monro proof

Prop 1. Suppose that $z_{n}$ is a positive sequence such that

where $a_n$ and $c_n$ are positive sequences such that

then $\lim_{n\rightarrow\infty} z_n = 0$.

Proof. We can assume that equality holds, i.e., $z_{n+1} = z_n (1-a_n) + c_n$. We can achieve this by increasing $a_n$ or decreasing $c_n$ in the inequality ; neither of which effect the conditions on $a_n$ and $b_n$, .

Now for all $n$ we have the following lower-bound

Since $\sum c_k < \infty$ it must be that $\sum a_kz_k < \infty$. Thus since both sums converge it must be that $\lim_n z_n$ converges. Finally since $\sum a_k = \infty$ and $\sum a_kz_k < \infty$ it must be that $\lim_n z_n =0$. $\square$

The following proposition is a Martingale version of the above result.

Thrm 1 [Robbins-Siegmund Theorem] If

for positive adaptive RVs $Z_n,a_n, b_n, c_n$ such that with probability 1,

then $\lim_{n\rightarrow\infty} z_n = 0$.

Proof. The results is some manipulations analogous to the Robbins-Monro proof and a bunch of nice reductions to Doob’s Martingale Convergence Theorem.

First note the result is equivalent to proving the result with $b_n = 0$ for all $n$. If we divide both sides of by $\prod_{m=0}^n(1-b_m)$ we get

where $a'_n = a_n / (1+b_n)$, $c'_n = c_n / \prod_{m=0}^n(1+b_m)$ and $Z'_n = Z_n / \prod_{m=0}^n(1+b_m)$. Notice since $\sum b_n$ converges then so does $\prod (1+b_n)$. Thus $a'_n$, $c'_n$ and $Z'_n$ have the same convergence properties as those required for $a_n$, $c_n$ and $Z_n$. Thus, we now assume $b_n=0$ for all $n$.

Now notice

is a super-martingale. We want to use Doob’s Martingale convergence theorem; however, we need to apply a localization argument to apply this. Specifically, let $\tau_C = \inf \{ n \geq 0 : \sum_{k=1}^n c'_k > C \}$. This is a stopping time. Notice

So $Y_{n\wedge\tau_C}$ is a super-martingale and below by $-C$. Thus by Doob’s Martingale Convergnce Theorem, $\lim_{n\rightarrow\infty} Y_{n\wedge\tau_C}$ exists for each $C>0$, and $\tau_C = \infty$ for some $C$, since $\sum c'_k <\infty$. Thus $\lim_{n\rightarrow\infty} Y_{n}$ exists.

Now notice

So like in the last proposition, since $\lim Y_n$ and $\sum c'_k$ exists, we see that $\sum_{k=1}^\infty a'_k Z'_k$ converges. And thus $Z'_{n+1}$ converges.

Finally since we assume $\sum_k a'_k = \infty$ and we know that $\sum_{k=1}^\infty a'_k Z'_k < \infty$ it must be that $Z'_k$ converges to zero.

Suppose that we have some function $F : \mathbb R^p \rightarrow \mathbb R$

that we wish to minimize. We suppose that the function $f(X; \theta)$ is known and so is its gradient $g(\theta;X)$, where $\mathbb E [g(\theta;X) ] = G(\theta)$ is the gradient of $F(\theta)$. The difficulty is that we do not have direct access to the distribution of $X$, but we can draw random samples $X_1,X_2,\dots$. We can use the Robbins-Monro Schema to optimize $F(\theta)$. Specifically we takewhere $g_n (\theta ) = g(\theta ; X_n)$ and $\epsilon_n = G(\theta) - g_n (\theta )$ . The above sequence is often referred to as Stochastic Gradient Descent. We chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

(Note here we may assume that $\alpha_n$ is a function of previous parameters and observations $\theta_1,...,\theta_{n-1}$ and $X_1,...,X_{n-1}$.) We let $||\cdot ||_2$ be the Euclidean norm. We can prove that convergence $\theta_n$ to the minimizer of $F(\theta)$.

Thrm 2 [Stochastic Gradient Descent] Suppose that $\theta_n$, $G(\cdot)$, and $\epsilon_n$ in Stochastic Gradient Descent satisfy the following conditions

1. $\exists \theta^*$ such that $\forall \theta$, $G(\theta) \cdot ( \theta - \theta^*) \geq \mu || \theta - \theta^* ||_2^2$
2. $||G(\theta_n)||_2^2 \leq A+ B ||\theta_n ||_2^2$
3. $\mathbb E [ || \epsilon_n||_2^2 | \mathcal F_n ] \leq K$

Then $\lim_n \theta_n = \theta^*$ where $\theta^*= \text{argmin}_\theta F(\theta)$ and assuming $\alpha_n$ are deterministic then $\lim \mathbb E [ || \theta_n - \theta^* ||_2^2 ] =0$

Let’s quickly review the conditions above. First consider Condition 1. Note condition $1$ implies moving in the direction of $\theta^*$ always decreases the $F(\theta)$, so $\theta^*$ minimizes $F$. The statement $( G(\theta) - G(\phi) ) \cdot ( \theta - \phi) \geq \mu || \theta - \phi ||^2$ is equivalent to $F(\theta)$ being strongly convex. So this is enough to give Condition 1. Condition 2 can be interpreted as a gradient condition, or that the steps $\theta_n$ do not grow unboundedly. Condition 3 is natural given our analysis so far.

Proof.

Taking expecations with respect to $\mathbb E [ | \mathcal F_n]$ we get

Thus, rearranging

Thus by Thrm 1, $\theta_{n+1} \rightarrow \theta^*$. Further taking expectations on both sides above we have

We can apply Prop 1 (note that $a_n = \alpha_n \mu + \alpha_n^2 B$ will be positive for suitably large $n$), to give that $\mathbb E || \theta_{n+1} - \theta^* ||_2^2 ] \rightarrow 0$ as $n\rightarrow \infty$ as required. $\square$

Finally we remark that in the proof we analyzed $|| \theta_n - \theta^* ||_2$ but equally we could have analyzed $F(\theta_n) - F(\theta^*)$ instead.

## Fixed Points and Asynchronous Update

We now consider Robbins-Monro from a slightly different perspective. Suppose we have a continuous function $F: \mathbb R^p \rightarrow \mathbb R^p$ and we wish to find a fixed point $x^*$ such that $F(x^*) =x^*$. We assume that $F(\cdot)$ is a contraction namely that, for some $\beta \in (0,1)$,

Here $|| x ||_\infty = \max_{i=1,...,p} | x_i |$. (Note this contraction condition implies the existence of a fixed point). (Note the previous analysis was somewhat restricted to euclidean space.) If we suppose that we do not observe $F(x)$ but instead some perturbed version whose mean is $F(x)$, then we can perform the Robbins-Monro update for each component $i=1,...p$:

where $\alpha_i(t)$ is a sequence such that for all $i$

for some constant $C$. Further we suppose that $\epsilon_i(t-1)$ is measurable with respect to $\mathcal F_{t}$, the filtration generated by $\{\alpha_i(s), x_i(s)\}_{s\leq t}$ measurable and

Note that in the above we let the step rule depend on $i$. For instance at each time $t$ we could chose to update one component only at each step, e.g., to update component $i$ only, we would set $\alpha_j(t) = 0$ for all $j \neq i$. Thus we can consider this step rule to be asynchronous.

We can analyze the convergence of this similarly

Thrm 3. Suppose that $F(\cdot)$ is a contraction with respect to $||\cdot||_\infty$ , suppose the vector $x(t)$ obeys the step rule with step sizes satisfying and further suppose the noise terms satisfy , then

where $x^*$ is the fixed point $F(x^*) = x^*$.

We will prove the result under the assumption that $x(t)$ is bounded in $t$, this is the proposition, Prop 3, below. We then prove that $x(t)$ is bounded in $t$ to complete the proof.

Prop 2. If we further assume that

with probability 1 then Thrm 3 holds.

We may assume with out loss of generality that $x^* = 0$, since the recursion above is equivalent to

Given the assumption above that $|| x(t)||_{\infty} \leq D_0$ for all $t$, further define

Here we choose $\epsilon$ so that $(1+2 \epsilon) \beta <1$ so that $D_k \rightarrow 0$. By induction, we will show that, given $|| x(t) ||_{\infty} < D_k$ for all $t\geq \tau_k$ for some $\tau_l$, then there exists a $\tau_{k+1}$ such that for all $t \geq \tau_{k+1}$

We use two recursions to bound the behavior of $x_i(t)$:

for $t\geq \tau_k$, where $W_i(\tau_k) = 0$ and $Y(\tau_k)=0$. We use $W_i(t)$ to summarize the effect of noise on the recursion for $x_i(t)$ and we use $Y_i(t)$ to bound the error arising from the function $F_i(x)$ in the recursion for $x_i(t)$. Specifically we show that

in Lemma 1 below. Further we notice that is a Robbin-Monro recursion for $W_i(t)$ to go to zero and $Y_i(t)$ to go to $\beta D_k$.

Lemma 1. $\forall t_0\geq \tau_k$

Proof. We prove the result by induction. The result is clearly true for $t=\tau_k$. In the inequality above with apply the induction hypothesis on $x_i(t)$ and bounds of $F_i$. The second equality just applies the definitions of $Y_i$ and $W_i$. Similar inequalities hold in the other direction and give the result. $\square$

Lemma 2.

Proof. Letting $W(t)=W(t,0)$, we know

From the Robbins-Siegmund Theorem (Prop 1), we know that

$\square$

Lemma 3.

Proof. Notice The result holds since $\sum \alpha_i (t) = \infty$. $\square$

We can now prove Prop [Tsit:Prop].

Proof of Prop 2. We know that $|| x(t)||_{\infty} \leq D_0$ for all $t$ and we assume $|| x(t)||_{\infty} \leq D_k$ for all $t\geq \tau_k$. By Prop 3 and then by Lemmas 2 and 3

as required. Thus these exists $\tau_{k+1}$ such that $\sup_{t\geq \tau_{k+1}} || x (t) ||_\infty \leq D_{k+1}$. Thus by induction we see that $\sup_{t\geq \tau_{k}} || x (t) ||_\infty$ decreases through sequence of levels $D_k$ as $k\rightarrow \infty$, thus $x(t)$ goes to zero as required.$\square$

### Proving Boundedness of $x(t)$

We now prove that $x(t)$ remains bounded

Prop 3.

To prove this proposition, we define a processes that bounds the max of $||x(t)||_\infty$ from above in increments of size $(1+\epsilon)$. Specfically we let

and we define $G(0) = \max \{ M(0), G_0 \}$ and let

where in the above $k$ is the smallest integer such that $M(t+1) \leq (1+\epsilon)^k G(t)$. Note that $G(t)$ is adapted. Further note $| F_i(x) | \leq \gamma \max \{ \max_j |x_j|,G_0 \}$ for some $G_0$ and $\gamma < 1$, since $\beta y +c \leq \gamma y\vee G_0$ for suitable choice of $\gamma$ and $G_0$. We use $G_0$ in the definition of $G(t)$ above and we choose $\epsilon$ so that $(1+\epsilon) \gamma <1$.

Also we define $\tilde W_i(t_0; t_0) = 0$ and

Notice, like before, $\tilde W$ is a Robbin-Monro recursion that goes to zero and $x_i(t)$ is bounded by a recursion of this type.

Lemma 4. If $G(t) = G(t_0)$ for $t \geq t_0$ then

Proof. The result is somewhat similar to Lemma [Tsit:bdd]. At $t_0$, we have that Now assuming that the bound is true at time $t$.

Above we bound $x(t)$ knowing the bound holds at time $t$ and we bound $F$ using the fact that we know $G(t)$ has not yet increased. We then use the fact we chose $\epsilon$ so that $\gamma (1+\epsilon) < 1$. A similar bound holds on the other side. $\square$

Lemma 5.

Proof. Since $G(t)$ is adapted and from our assumptions on $w_i(t)$, we have that

We know that $\lim_{t\rightarrow\infty} |\tilde W_i(t;0)|$ – the argument for this is identical to Lemma 2. Further notice for all $t \geq t_0$ we have

Thus,

As required both terms on the righthand-side go to zero. $\square$

Proof of Prop 3. To prove the proposition we will show that $G(t)$ at some point must remain bounded. Suppose $t_0$ is a time just after $G(t)$ increased. Note if $x_i(t)$ grew unboundedly then we could chose $t_0$ as large as we like.

So we’d know by Lemma 5 that there exists a $t_0$ such that for all $t \geq t_0$, $|\tilde W_i(t;t_0)| \leq \epsilon/2$. Thus applying this to Lemma 4 we see that

Thus since $M(t) < G(t) (1+\epsilon)$ for all $t \geq t_0$. So $G(t)$ can not longer increase and thus we arrive at a contradiction. $x_i(t)$ must be bounded in $t$. $\square$