Robbins-Monro

We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

Often it is important to find a solution to the equation

by evaluating $g$ at a sequence of points. For instance Newton’s method would perform the updates $x_{n+1} = x_n - g(x_n)/g'(x_n)$ . However, Robbins and Monro consider the setting where we cannot directly observe $g$ but we might observe some random variable whose mean is $g(x)$ . Thus we observe

$\label{RM:Regress} y_n = g(x_n) + \epsilon_n$

and hope solve for $g(x) = 0$ . Notice in this setting, even if we can find $g'(x)$ , Newton’s method may not converge. The key idea of Robbins and Monro is to use a schema where

$\label{RM:RM} x_{n+1} = x_n - \alpha_n y_n \tag{RM}$

where we chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

$\sum_{n} \alpha_n = \infty, \qquad \sum_{n} \alpha_n^2 < \infty\, .$

Before proceeding here are a few different use cases:

Quartiles. We want to find $x$ such that $P(X \leq x ) = p$ for some fixed $p$ . But we can only sample the random variable $X$ .
Regression. We preform regression $g(x) = \beta_0 + \beta_1 x$ , but rather than estimate $\beta$ we want to know where $g(x)=0$ .
Optimization. We want to optimize a convex function $f(x)$ whose gradient is $g(x)$ . Assume that $f(x) = \sum_{k=1}^K f_k(x)$ for some large $K$ . To find the optimum at $g(x)=0$ we randomly sample (uniformly) $f_k(x)$ whose gradient, $g_k(x)$ , is an bias estimate of $g(x)$ .

The following result contains the key elements of the Robbins-Monro proof

Prop 1. Suppose that $z_{n}$ is a positive sequence such that

$\label{RM:zInq} z_{n+1} \leq z_n (1-a_n) + c_n$

where $a_n$ and $c_n$ are positive sequences such that

$\label{RM:ab} \sum_n a_n =\infty,\qquad \text{ and }\qquad \sum_n c_n < \infty$

then $\lim_{n\rightarrow\infty} z_n = 0$ .

Proof. We can assume that equality holds, i.e., $z_{n+1} = z_n (1-a_n) + c_n$ . We can achieve this by increasing $a_n$ or decreasing $c_n$ in the inequality ; neither of which effect the conditions on $a_n$ and $b_n$ , .

Now for all $n$ we have the following lower-bound

$-z_0 \leq z_n -z_0 = \sum_{k=1}^{n-1} ( z_{k+1} - z_k ) = \sum_{k=1}^{n-1} c_k - \sum_{k=1}^{n-1} a_k z_k$

Since $\sum c_k < \infty$ it must be that $\sum a_kz_k < \infty$ . Thus since both sums converge it must be that $\lim_n z_n$ converges. Finally since $\sum a_k = \infty$ and $\sum a_kz_k < \infty$ it must be that $\lim_n z_n =0$ . $\square$

The following proposition is a Martingale version of the above result.

Thrm 1 [Robbins-Siegmund Theorem] If

$\label{RM:RS} \mathbb E [Z_{n+1} | \mathcal F_n ] \leq (1-a_n+b_n) Z_n +c_n$

for positive adaptive RVs $Z_n,a_n, b_n, c_n$ such that with probability 1,

$\sum_n a_n = \infty,\qquad \sum_n b_n <\infty,\qquad \text{and}\qquad \sum_n c_n <\infty$

then $\lim_{n\rightarrow\infty} z_n = 0$ .

Proof. The results is some manipulations analogous to the Robbins-Monro proof and a bunch of nice reductions to Doob’s Martingale Convergence Theorem.

First note the result is equivalent to proving the result with $b_n = 0$ for all $n$ . If we divide both sides of by $\prod_{m=0}^n(1-b_m)$ we get

$\mathbb E [Z'_{n+1} | \mathcal F_n ] \leq (1-a'_n) Z'_n +c'_n\,,$

where $a'_n = a_n / (1+b_n)$ , $c'_n = c_n / \prod_{m=0}^n(1+b_m)$ and $Z'_n = Z_n / \prod_{m=0}^n(1+b_m)$ . Notice since $\sum b_n$ converges then so does $\prod (1+b_n)$ . Thus $a'_n$ , $c'_n$ and $Z'_n$ have the same convergence properties as those required for $a_n$ , $c_n$ and $Z_n$ . Thus, we now assume $b_n=0$ for all $n$ .

Now notice

$Y_n = Z'_n + \sum_{k=0}^{n-1} a'_k Z'_k - \sum_{k=0}^{n-1} c'_k$

is a super-martingale. We want to use Doob’s Martingale convergence theorem; however, we need to apply a localization argument to apply this. Specifically, let $\tau_C = \inf \{ n \geq 0 : \sum_{k=1}^n c'_k > C \}$ . This is a stopping time. Notice

$Y_{n\wedge\tau_C} \geq -\sum_{k=0}^{n\wedge\tau_C-1} c'_k \geq -C.$

So $Y_{n\wedge\tau_C}$ is a super-martingale and below by $-C$ . Thus by Doob’s Martingale Convergnce Theorem, $\lim_{n\rightarrow\infty} Y_{n\wedge\tau_C}$ exists for each $C>0$ , and $\tau_C = \infty$ for some $C$ , since $\sum c'_k <\infty$ . Thus $\lim_{n\rightarrow\infty} Y_{n}$ exists.

Now notice

$\sum_{k=1}^n c'_k - \sum_{k=1}^n a'_k Z'_k = Z'_{n+1} - Y_{n+1} \leq - Y_{n+1}.$

So like in the last proposition, since $\lim Y_n$ and $\sum c'_k$ exists, we see that $\sum_{k=1}^\infty a'_k Z'_k$ converges. And thus $Z'_{n+1}$ converges.

Finally since we assume $\sum_k a'_k = \infty$ and we know that $\sum_{k=1}^\infty a'_k Z'_k < \infty$ it must be that $Z'_k$ converges to zero.

Stochastic Gradient Decent

Suppose that we have some function $F : \mathbb R^p \rightarrow \mathbb R$

$F(\theta) = \mathbb E_X [ f(X ; \theta) ]$

that we wish to minimize. We suppose that the function $f(X; \theta)$ is known and so is its gradient $g(\theta;X)$ , where $\mathbb E [g(\theta;X) ] = G(\theta)$ is the gradient of $F(\theta)$ . The difficulty is that we do not have direct access to the distribution of $X$ , but we can draw random samples $X_1,X_2,\dots$ . We can use the Robbins-Monro Schema to optimize $F(\theta)$ . Specifically we take Screenshot 2019-01-26 at 16.59.04.png where $g_n (\theta ) = g(\theta ; X_n)$ and $\epsilon_n = G(\theta) - g_n (\theta )$ . The above sequence is often referred to as Stochastic Gradient Descent. We chose the sequence $\{\alpha_n\}_{n=0}^\infty$ so that

$\sum_{n} \alpha_n = \infty, \qquad \sum_{n} \alpha_n^2 < \infty\, .$

(Note here we may assume that $\alpha_n$ is a function of previous parameters and observations $\theta_1,...,\theta_{n-1}$ and $X_1,...,X_{n-1}$ .) We let $||\cdot ||_2$ be the Euclidean norm. We can prove that convergence $\theta_n$ to the minimizer of $F(\theta)$ .

Thrm 2 [Stochastic Gradient Descent] Suppose that $\theta_n$ , $G(\cdot)$ , and $\epsilon_n$ in Stochastic Gradient Descent satisfy the following conditions

$\exists \theta^*$ such that $\forall \theta$ , $G(\theta) \cdot ( \theta - \theta^*) \geq \mu || \theta - \theta^* ||_2^2$
$||G(\theta_n)||_2^2 \leq A+ B ||\theta_n ||_2^2$
$\mathbb E [ || \epsilon_n||_2^2 | \mathcal F_n ] \leq K$

Then $\lim_n \theta_n = \theta^*$ where $\theta^*= \text{argmin}_\theta F(\theta)$ and assuming $\alpha_n$ are deterministic then $\lim \mathbb E [ || \theta_n - \theta^* ||_2^2 ] =0$

Let’s quickly review the conditions above. First consider Condition 1. Note condition $1$ implies moving in the direction of $\theta^*$ always decreases the $F(\theta)$ , so $\theta^*$ minimizes $F$ . The statement $( G(\theta) - G(\phi) ) \cdot ( \theta - \phi) \geq \mu || \theta - \phi ||^2$ is equivalent to $F(\theta)$ being strongly convex. So this is enough to give Condition 1. Condition 2 can be interpreted as a gradient condition, or that the steps $\theta_n$ do not grow unboundedly. Condition 3 is natural given our analysis so far.

Proof.

$\begin{aligned} & || \theta_{n+1} - \theta^* ||_2^2 - || \theta_{n} - \theta^* ||_2^2 \\ = & - \alpha_n G(\theta_n) \cdot ( \theta_n -\theta^* ) - \alpha_n \epsilon_n \cdot ( \theta_n - \alpha_n G(\theta_n) -\theta^* ) + \alpha^2_n || \epsilon_n||_2^2 + \alpha^2_n ||G(\theta_n) ||_2^2\end{aligned}$

Taking expecations with respect to $\mathbb E [ | \mathcal F_n]$ we get

$\begin{aligned} & \mathbb E [ || \theta_{n+1} - \theta^* ||_2^2 - || \theta_{n} - \theta^* ||_2^2 | \mathcal F_n] \\ = & - \alpha_n G(\theta_n) \cdot ( \theta_n -\theta^* ) + \alpha^2_n \mathbb E [ || \epsilon_n||_2^2 | \mathcal F_n ] + \alpha^2_n ||G(\theta_n) ||_2^2 \\ \leq & - \alpha_n \mu || \theta_n -\theta^* ||_2^2 + \alpha^2_n K + \alpha^2_n (A+ B ||\theta_n -\theta^* ||_2^2)\end{aligned}$

Thus, rearranging

$\mathbb E [ || \theta_{n+1} - \theta^* ||_2^2 | \mathcal F_n] \leq (1- \alpha_n \mu + \alpha_n^2 B ) || \theta_{n} - \theta^* ||_2^2 + \alpha_n^2 (K+A)$

Thus by Thrm 1, $\theta_{n+1} \rightarrow \theta^*$ . Further taking expectations on both sides above we have

$\mathbb E [ || \theta_{n+1} - \theta^* ||_2^2 ] \leq (1- \alpha_n \mu + \alpha_n^2 B ) \mathbb E [ || \theta_{n} - \theta^* ||_2^2 ] + \alpha_n^2 (K+A)$

We can apply Prop 1 (note that $a_n = \alpha_n \mu + \alpha_n^2 B$ will be positive for suitably large $n$ ), to give that $\mathbb E || \theta_{n+1} - \theta^* ||_2^2 ] \rightarrow 0$ as $n\rightarrow \infty$ as required. $\square$

Finally we remark that in the proof we analyzed $|| \theta_n - \theta^* ||_2$ but equally we could have analyzed $F(\theta_n) - F(\theta^*)$ instead.

Fixed Points and Asynchronous Update

We now consider Robbins-Monro from a slightly different perspective. Suppose we have a continuous function $F: \mathbb R^p \rightarrow \mathbb R^p$ and we wish to find a fixed point $x^*$ such that $F(x^*) =x^*$ . We assume that $F(\cdot)$ is a contraction namely that, for some $\beta \in (0,1)$ ,

$\label{RM:contract} || F(x) - F(y) ||_{\infty} \leq \beta || x- y ||_{\infty} \, .$

Here $|| x ||_\infty = \max_{i=1,...,p} | x_i |$ . (Note this contraction condition implies the existence of a fixed point). (Note the previous analysis was somewhat restricted to euclidean space.) If we suppose that we do not observe $F(x)$ but instead some perturbed version whose mean is $F(x)$ , then we can perform the Robbins-Monro update for each component $i=1,...p$ :

$\label{RM:Fix} x_i(t+1) = x_i(t) + \alpha_i(t) ( F_i(x(t)) - x_i(t) + \epsilon_i(t)) \tag{RM-Async}$

where $\alpha_i(t)$ is a sequence such that for all $i$

$\label{RM:stepfix} \sum_{t} \alpha_i(t) = \infty, \qquad \sum_{t} \alpha_i^2(t) < C\, . \tag{RM step}$

for some constant $C$ . Further we suppose that $\epsilon_i(t-1)$ is measurable with respect to $\mathcal F_{t}$ , the filtration generated by $\{\alpha_i(s), x_i(s)\}_{s\leq t}$ measurable and

$\mathbb E [ \epsilon_i(t) | \mathcal F_t ] = 0\, . \qquad \text{and} \qquad \mathbb E [ \epsilon^2_i(t) | \mathcal F_t ] \leq A + B \max_j | x_j(t) |^2\, . \label{RM:noise}$

Note that in the above we let the step rule depend on $i$ . For instance at each time $t$ we could chose to update one component only at each step, e.g., to update component $i$ only, we would set $\alpha_j(t) = 0$ for all $j \neq i$ . Thus we can consider this step rule to be asynchronous.

We can analyze the convergence of this similarly

Thrm 3. Suppose that $F(\cdot)$ is a contraction with respect to $||\cdot||_\infty$ , suppose the vector $x(t)$ obeys the step rule with step sizes satisfying and further suppose the noise terms satisfy , then

$\lim_{t\rightarrow\infty} x(t) = x^*$

where $x^*$ is the fixed point $F(x^*) = x^*$ .

We will prove the result under the assumption that $x(t)$ is bounded in $t$ , this is the proposition, Prop 3, below. We then prove that $x(t)$ is bounded in $t$ to complete the proof.

Prop 2. If we further assume that

$\sup_t ||x(t)||_{\infty} < \infty \,$

with probability 1 then Thrm 3 holds.

We may assume with out loss of generality that $x^* = 0$ , since the recursion above is equivalent to $x_i(t+1)-x^* = x_i(t)-x^* + \alpha_i(t) ( F_i(x(t))- F_i(x^*) - x_i(t)+x^* + \epsilon_i(t)) \, .$

Given the assumption above that $|| x(t)||_{\infty} \leq D_0$ for all $t$ , further define

$D_{k+1} = \beta ( 1 + 2 \epsilon ) D_k$

Here we choose $\epsilon$ so that $(1+2 \epsilon) \beta <1$ so that $D_k \rightarrow 0$ . By induction, we will show that, given $|| x(t) ||_{\infty} < D_k$ for all $t\geq \tau_k$ for some $\tau_l$ , then there exists a $\tau_{k+1}$ such that for all $t \geq \tau_{k+1}$

$|| x(t) ||_{\infty} < D_{k+1}$

We use two recursions to bound the behavior of $x_i(t)$ :

$\begin{aligned} W_i(t+1 ) & = (1 - \alpha_i(t) ) W_i(t) + \alpha_i(t) w_i(t) \\ Y_i(t+1) & = (1-\alpha_i(t) ) Y_i(t) + \alpha_i(t) \beta D_k\, .\end{aligned}$

for $t\geq \tau_k$ , where $W_i(\tau_k) = 0$ and $Y(\tau_k)=0$ . We use $W_i(t)$ to summarize the effect of noise on the recursion for $x_i(t)$ and we use $Y_i(t)$ to bound the error arising from the function $F_i(x)$ in the recursion for $x_i(t)$ . Specifically we show that

in Lemma 1 below. Further we notice that is a Robbin-Monro recursion for $W_i(t)$ to go to zero and $Y_i(t)$ to go to $\beta D_k$ .

Lemma 1. $\forall t_0\geq \tau_k$

Proof. We prove the result by induction. The result is clearly true for $t=\tau_k$ . $\begin{aligned} x_i(t+1) & = (1-\alpha_i(t) ) x_i(t) + \alpha_i(t) F_i(\bm x^i(t)) +\alpha_i(t) w_i(t) \\ & \leq ( 1 - \alpha_i(t) ) ( Y_i(t) + W_i(t ) ) + \alpha_i(t) \beta D_k + \alpha_i(t) \epsilon_i(t) \\ & = Y_i(t+1) + W_i(t+1) \end{aligned}$ In the inequality above with apply the induction hypothesis on $x_i(t)$ and bounds of $F_i$ . The second equality just applies the definitions of $Y_i$ and $W_i$ . Similar inequalities hold in the other direction and give the result. $\square$

Lemma 2. $\lim_{t \rightarrow \infty} | W_i(t) | =0$

Proof. Letting $W(t)=W(t,0)$ , we know

$\mathbb E [ W(t+1)^2 | \mathcal F_t ] \leq (1- 2\alpha(t) + \alpha^2) W(t)^2 + \alpha(t)^2 \mathbb E [ \epsilon (t)^2 | \mathcal F_t ].$

From the Robbins-Siegmund Theorem (Prop 1), we know that

$\lim_{t\rightarrow\infty} W(t) =0.$

$\square$

Lemma 3.

Proof. Notice $Y_i(t+1) - \beta D_k = (1-\alpha_i(t) ) ( Y_i(t) - \beta D_k ) = ... = \left( \prod_{s=1}^t (1-\alpha_i(s) ) \right) ( Y_i(0) - \beta D_k )$ The result holds since $\sum \alpha_i (t) = \infty$ . $\square$

We can now prove Prop [Tsit:Prop].

Proof of Prop 2. We know that $|| x(t)||_{\infty} \leq D_0$ for all $t$ and we assume $|| x(t)||_{\infty} \leq D_k$ for all $t\geq \tau_k$ . By Prop 3 and then by Lemmas 2 and 3

$|x_i(t) | \leq Y_i(t) + | W_i(t ) | \xrightarrow[t\rightarrow\infty]{} \beta D_k$

as required. Thus these exists $\tau_{k+1}$ such that $\sup_{t\geq \tau_{k+1}} || x (t) ||_\infty \leq D_{k+1}$ . Thus by induction we see that $\sup_{t\geq \tau_{k}} || x (t) ||_\infty$ decreases through sequence of levels $D_k$ as $k\rightarrow \infty$ , thus $x(t)$ goes to zero as required. $\square$

Proving Boundedness of $x(t)$

We now prove that $x(t)$ remains bounded

Prop 3. $\sup_t ||x(t)||_{\infty} < \infty \,$

To prove this proposition, we define a processes that bounds the max of $||x(t)||_\infty$ from above in increments of size $(1+\epsilon)$ . Specfically we let

$M(t) := \max_{\tau \leq t } ||x(\tau) ||_\infty$

and we define $G(0) = \max \{ M(0), G_0 \}$ and let

$G(t+1) = \begin{cases} G(t) & \text{if } M(t+1) < (1+\epsilon) G(t), \\ (1+\epsilon)^k G(t) & \text{if } M(t+1) \geq (1+\epsilon) G(t)\, . \end{cases}$

where in the above $k$ is the smallest integer such that $M(t+1) \leq (1+\epsilon)^k G(t)$ . Note that $G(t)$ is adapted. Further note $| F_i(x) | \leq \gamma \max \{ \max_j |x_j|,G_0 \}$ for some $G_0$ and $\gamma < 1$ , since $\beta y +c \leq \gamma y\vee G_0$ for suitable choice of $\gamma$ and $G_0$ . We use $G_0$ in the definition of $G(t)$ above and we choose $\epsilon$ so that $(1+\epsilon) \gamma <1$ .

Also we define $\tilde W_i(t_0; t_0) = 0$ and

$\tilde W_i(t+1 ; t_0 ) = (1 - \alpha_i(t) ) \tilde W_i(t; t_0) + \alpha_i(t) \tilde w_i(t)\, , \quad \text{where} \quad \tilde w_i (t) = \frac{ w_i(t) }{ G(t) }\, .$

Notice, like before, $\tilde W$ is a Robbin-Monro recursion that goes to zero and $x_i(t)$ is bounded by a recursion of this type.

Lemma 4. If $G(t) = G(t_0)$ for $t \geq t_0$ then $| x_i(t) | \leq G(t_0) + \tilde W_i(t; t_0) G(t_0)\, .$

Proof. The result is somewhat similar to Lemma [Tsit:bdd]. At $t_0$ , we have that $| x_i(t_0) | \leq M(t_0) \leq G(t_0)\, .$ Now assuming that the bound is true at time $t$ .

$\begin{aligned} x_i(t+1) & = (1-\alpha_i(t)) x_i(t) + \alpha_i(t) F_i(\bm x^i(t)) + \alpha_i(t) w_i(t) \\ \leq & (1-\alpha_i(t)) \{ G(t_0) + \tilde W_i(t;t_0) G(t_0) \} + \alpha_i(t) \gamma G(t_0) (1+\epsilon) + \alpha_i(t) \tilde w_i(t) G(t_0) \\ \leq & G(t_0) + [ (1-\alpha_i) \tilde W_i(t;t_0) + \alpha_i \tilde w_i(t) ] G(t_0)\, .\end{aligned}$

Above we bound $x(t)$ knowing the bound holds at time $t$ and we bound $F$ using the fact that we know $G(t)$ has not yet increased. We then use the fact we chose $\epsilon$ so that $\gamma (1+\epsilon) < 1$ . A similar bound holds on the other side. $\square$

Lemma 5. $\lim_{t_0 \rightarrow \infty } \sup_{t\geq t_0 } | \tilde{W}(t;t_0) | = 0$

Proof. Since $G(t)$ is adapted and from our assumptions on $w_i(t)$ , we have that

$\mathbb E [ \tilde{w}_i(t) | \mathcal F_t ] = 0 \quad \text{and} \quad \mathbb E [ \tilde w_i(t)^2 | \mathcal F_t ] \leq K\, .$

We know that $\lim_{t\rightarrow\infty} |\tilde W_i(t;0)|$ – the argument for this is identical to Lemma 2. Further notice for all $t \geq t_0$ we have

$\begin{aligned} W_i(t; 0 ) - W_i(t; t_0 ) & = (1 - \alpha_i(t) ) \Big[ W_i(t-1; 0 ) - W_i(t-1; t_0 ) \Big] \\ & =\dots = \prod_{s=t_0}^t (1-\alpha(s) ) \cdot W(t_0; 0)\, .\end{aligned}$ Thus,

$| W_i(t;t_0) | \leq | W_i(t_0;0) | + | W_i(t;0)| \,.$

As required both terms on the righthand-side go to zero. $\square$

Proof of Prop 3. To prove the proposition we will show that $G(t)$ at some point must remain bounded. Suppose $t_0$ is a time just after $G(t)$ increased. Note if $x_i(t)$ grew unboundedly then we could chose $t_0$ as large as we like.

So we’d know by Lemma 5 that there exists a $t_0$ such that for all $t \geq t_0$ , $|\tilde W_i(t;t_0)| \leq \epsilon/2$ . Thus applying this to Lemma 4 we see that

$| x_i(t) | \leq G(t_0) + \tilde W_i(t; t_0) G(t_0) \leq G(t_0) (1+\epsilon/2) \, .$

Thus since $M(t) < G(t) (1+\epsilon)$ for all $t \geq t_0$ . So $G(t)$ can not longer increase and thus we arrive at a contradiction. $x_i(t)$ must be bounded in $t$ . $\square$

Literature

The Robbin-Munro proceedure is due to Robbins and Munro and the Robbin Sigmund Theorem is due to Robbins and Sigmund. Good notes are given by Francis Bach. The Asynchronous implementation is due to Tsitsiklis, though we replace the a couple of results their with the Robbins-Sigmund theorem.

Robbins, Herbert; Monro, Sutton. A Stochastic Approximation Method. Ann. Math. Statist. 22 (1951), no. 3, 400–407. doi:10.1214/aoms/1177729586.

Robbins, Herbert, and David Siegmund. “A convergence theorem for non negative almost supermartingales and some applications.” Optimizing methods in statistics. Academic Press, 1971. 233-257.

Francis Bach. Optimisation et Apprentissage Statistique https://www.di.ens.fr/~fbach/orsay2019.html

Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” Machine learning 16.3 (1994): 185-202.

Robbins-Monro

Stochastic Gradient Decent

Fixed Points and Asynchronous Update

Proving Boundedness of $x(t)$

One thought on “Robbins-Monro”

Leave a comment Cancel reply

Stochastic Gradient Decent

Fixed Points and Asynchronous Update

Proving Boundedness of

Share this:

One thought on “Robbins-Monro”

Leave a comment Cancel reply

Proving Boundedness of $x(t)$