We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

Often it is important to find a solution to the equation

by evaluating at a sequence of points. For instance Newton’s method would perform the updates . However, Robbins and Monro consider the setting where we cannot directly observe but we might observe some random variable whose mean is . Thus we observe

and hope solve for . Notice in this setting, even if we can find , Newton’s method may not converge. The key idea of Robbins and Monro is to use a schema where

where we chose the sequence so that

Before proceeding here are a few different use cases:

**Quartiles.**We want to find such that for some fixed . But we can only sample the random variable .**Regression.**We preform regression , but rather than estimate we want to know where .**Optimization.**We want to optimize a convex function whose gradient is . Assume that for some large . To find the optimum at we randomly sample (uniformly) whose gradient, , is an bias estimate of .

The following result contains the key elements of the Robbins-Monro proof

**Prop 1.** Suppose that is a positive sequence such that

where and are positive sequences such that

then .

**Proof.** We can assume that equality holds, i.e., . We can achieve this by increasing or decreasing in the inequality ; neither of which effect the conditions on and , .

Now for all we have the following lower-bound

Since it must be that . Thus since both sums converge it must be that converges. Finally since and it must be that .

The following proposition is a Martingale version of the above result.

**Thrm 1** [Robbins-Siegmund Theorem] If

for positive adaptive RVs such that with probability 1,

then .

**Proof.** The results is some manipulations analogous to the Robbins-Monro proof and a bunch of nice reductions to Doob’s Martingale Convergence Theorem.

First note the result is equivalent to proving the result with for all . If we divide both sides of by we get

where , and . Notice since converges then so does . Thus , and have the same convergence properties as those required for , and . Thus, we now assume for all .

Now notice

is a super-martingale. We want to use Doob’s Martingale convergence theorem; however, we need to apply a localization argument to apply this. Specifically, let . This is a stopping time. Notice

So is a super-martingale and below by . Thus by Doob’s Martingale Convergnce Theorem, exists for each , and for some , since . Thus exists.

Now notice

So like in the last proposition, since and exists, we see that converges. And thus converges.

Finally since we assume and we know that it must be that converges to zero.

## Stochastic Gradient Decent

Suppose that we have some function

that we wish to minimize. We suppose that the function is known and so is its gradient , where is the gradient of . The difficulty is that we do not have direct access to the distribution of , but we can draw random samples . We can use the Robbins-Monro Schema to optimize . Specifically we takewhere and . The above sequence is often referred to as *Stochastic Gradient Descent*. We chose the sequence so that

(Note here we may assume that is a function of previous parameters and observations and .) We let be the Euclidean norm. We can prove that convergence to the minimizer of .

**Thrm 2** [Stochastic Gradient Descent] Suppose that , , and in Stochastic Gradient Descent satisfy the following conditions

- such that ,

Then where and assuming are deterministic then

Let’s quickly review the conditions above. First consider Condition 1. Note condition implies moving in the direction of always decreases the , so minimizes . The statement is equivalent to being strongly convex. So this is enough to give Condition 1. Condition 2 can be interpreted as a gradient condition, or that the steps do not grow unboundedly. Condition 3 is natural given our analysis so far.

**Proof.**

Taking expecations with respect to we get

Thus, rearranging

Thus by Thrm 1, . Further taking expectations on both sides above we have

We can apply Prop 1 (note that will be positive for suitably large ), to give that as as required.

Finally we remark that in the proof we analyzed but equally we could have analyzed instead.

## Fixed Points and Asynchronous Update

We now consider Robbins-Monro from a slightly different perspective. Suppose we have a continuous function and we wish to find a fixed point such that . We assume that is a contraction namely that, for some ,

Here . (Note this contraction condition implies the existence of a fixed point). (Note the previous analysis was somewhat restricted to euclidean space.) If we suppose that we do not observe but instead some perturbed version whose mean is , then we can perform the Robbins-Monro update for each component :

where is a sequence such that for all

for some constant . Further we suppose that is measurable with respect to , the filtration generated by measurable and

Note that in the above we let the step rule depend on . For instance at each time we could chose to update one component only at each step, e.g., to update component only, we would set for all . Thus we can consider this step rule to be asynchronous.

We can analyze the convergence of this similarly

**Thrm 3.** Suppose that is a contraction with respect to , suppose the vector obeys the step rule with step sizes satisfying and further suppose the noise terms satisfy , then

where is the fixed point .

We will prove the result under the assumption that is bounded in , this is the proposition, Prop 3, below. We then prove that is bounded in to complete the proof.

**Prop 2.** If we further assume that

with probability 1 then Thrm 3 holds.

We may assume with out loss of generality that , since the recursion above is equivalent to

Given the assumption above that for all , further define

Here we choose so that so that . By induction, we will show that, given for all for some , then there exists a such that for all

We use two recursions to bound the behavior of :

for , where and . We use to summarize the effect of noise on the recursion for and we use to bound the error arising from the function in the recursion for . Specifically we show that

in Lemma 1 below. Further we notice that is a Robbin-Monro recursion for to go to zero and to go to .

**Lemma 1.**

**Proof.** We prove the result by induction. The result is clearly true for . In the inequality above with apply the induction hypothesis on and bounds of . The second equality just applies the definitions of and . Similar inequalities hold in the other direction and give the result.

**Lemma 2. **

**Proof.** Letting , we know

From the Robbins-Siegmund Theorem (Prop 1), we know that

**Lemma 3.**

**Proof.** Notice The result holds since .

We can now prove Prop [Tsit:Prop].

**Proof of Prop 2.** We know that for all and we assume for all . By Prop 3 and then by Lemmas 2 and 3

as required. Thus these exists such that . Thus by induction we see that decreases through sequence of levels as , thus goes to zero as required.

### Proving Boundedness of

We now prove that remains bounded

**Prop 3.**

To prove this proposition, we define a processes that bounds the max of from above in increments of size . Specfically we let

and we define and let

where in the above is the smallest integer such that . Note that is adapted. Further note for some and , since for suitable choice of and . We use in the definition of above and we choose so that .

Also we define and

Notice, like before, is a Robbin-Monro recursion that goes to zero and is bounded by a recursion of this type.

**Lemma 4.** If for then

**Proof.** The result is somewhat similar to Lemma [Tsit:bdd]. At , we have that Now assuming that the bound is true at time .

Above we bound knowing the bound holds at time and we bound using the fact that we know has not yet increased. We then use the fact we chose so that . A similar bound holds on the other side.

**Lemma 5.**

**Proof.** Since is adapted and from our assumptions on , we have that

We know that – the argument for this is identical to Lemma 2. Further notice for all we have

Thus,

As required both terms on the righthand-side go to zero.

**Proof of Prop 3.** To prove the proposition we will show that at some point must remain bounded. Suppose is a time just after increased. Note if grew unboundedly then we could chose as large as we like.

So we’d know by Lemma 5 that there exists a such that for all , . Thus applying this to Lemma 4 we see that

Thus since for all . So can not longer increase and thus we arrive at a contradiction. must be bounded in .

**Literature**

The Robbin-Munro proceedure is due to Robbins and Munro and the Robbin Sigmund Theorem is due to Robbins and Sigmund. Good notes are given by Francis Bach. The Asynchronous implementation is due to Tsitsiklis, though we replace the a couple of results their with the Robbins-Sigmund theorem.

Robbins, Herbert; Monro, Sutton. A Stochastic Approximation Method. Ann. Math. Statist. 22 (1951), no. 3, 400–407. doi:10.1214/aoms/1177729586.

Robbins, Herbert, and David Siegmund. “A convergence theorem for non negative almost supermartingales and some applications.” *Optimizing methods in statistics*. Academic Press, 1971. 233-257.

Francis Bach. Optimisation et Apprentissage Statistique https://www.di.ens.fr/~fbach/orsay2019.html

Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” *Machine learning* 16.3 (1994): 185-202.

## One thought on “Robbins-Monro”