ODE method for Stochastic Approximation

We consider the Robbins-Monro update

$x_{n+1} = x_n +\alpha_n [g(x_n) + \epsilon_n]$

and argue that this can be approximated by the o.d.e.:

$\frac{dz}{d t} = g(z(t)).$

The condition for convergence used was

$\label{RMODE:RM} \sum_{n=0}^\infty \alpha_n = \infty, \qquad \sum_{n=0}^\infty \alpha^2_n < \infty \, . \tag{RM}$

Put informally, the condition $\sum_n \alpha_n$ is used to keep the process moving [albeit in decreasingly small increments] while the condition $\sum_n \alpha^2_n$ ensures that the noise from the process goes to zero.

Given that noise is suppressed and increments get small, it is natural to ask how close the above process is to the ordinary differential equation

$\frac{dz}{d t} = g(z(t)).$

Moreover, can Lyapunov stability results [that we applied earlier] be directly applied to prove the the convergence of the corresponding stochastic approximation scheme? Often, the answer is yes. And this has certain conceptual advantages, since the Lyapunov conditions described earlier can be quite straightforward to establish and also we don’t need to directly deal with the compounding of small stochastic errors from the sequence $\epsilon_n$ .

The set up. The idea is to let

$t_n = \sum_{k=0}^n \alpha_k\quad \text{and}\quad \mathcal T = \{ t_n : n\in \mathbb Z_+\}\, .$

Here $t_n$ represents the amount of “time” that the process has been running for. We then let

$x(t) = x_n , \qquad \text{for } t = t_n$ ¹

We let $z_m$ be the solution above o.d.e. started at $x_m$ at time $t_m$ , that is

$\frac{dz_m}{d t} = g(z_m(t)),\quad \text{and}\quad z_m(t_m) = x_m.$

We then compare $x(t)$ and $z_m(t)$ to see how much error has accumulated since time $t_m$ . More specifically we are interested in

$\sup_{t\in [t_m,t_m+T]\cap \mathcal T} || x(t) - z_m(t) ||\, .$

Assumptions. In addition to the Robbins-Monro condition , we make the following assumptions.

$g$ is Lipschitz continuous.
For $F_n = (x_m,\epsilon_{m-1} : m \leq n)$ $\mathbb E [ \epsilon_n | F_n ] = 0$ and $\sup_n \mathbb E [\epsilon^2]<\infty$ .

Main result. A key result that we will prove is the following proposition

Proposition 1.

$\lim_{m\rightarrow \infty} \sup_{t\in [t_m,t_m+T]\cap \mathcal T} || x(t) - z_m(t) || = 0\, .$

where convergence holds with probability $1$ and in $L^2$ .²

Proof. Notice

$z_m(t_n) = x_m + \int_{t_m}^{t_n} z_m(u) du$

while

$\begin{aligned} x(t_n) = x_m + \sum_{k=m}^{n-1} \alpha_k g(x_k) + \sum_{k=m}^{n-1} \alpha_k \epsilon_k = x_m + \int_{t_m}^{t_n} g( x ( \lfloor u \rfloor_\alpha )) du + \sum_{k=m}^{n-1} \alpha_k \epsilon_k\end{aligned}$

where $\lfloor u \rfloor_\alpha = \max\{t_n : t_n \leq u \}$ . So

$\begin{aligned} || z_m(t_n) - x(t_n) || & \leq \Big|\Big| \sum_{k=m}^{n-1} \alpha_k \epsilon_k \Big| \Big| + \int_{t_m}^{t_n} || g( z_m (u)) - g( x ( \lfloor u \rfloor_\alpha )) || du \\ & \leq \Big|\Big| \sum_{k=m}^{n-1} \alpha_k \epsilon_k \Big| \Big| + \int_{t_m}^{t_n} || z_m ( u )- x ( \lfloor u \rfloor_\alpha )|| du \end{aligned}$

which implies by Gronwall’s Lemma $\label{RMODE:Bdd} || z_m(t_n) - x(t_n) || \leq \Big|\Big| \sum_{k=m}^{n-1} \alpha_k \epsilon_k \Big| \Big| e^{L(t_n-t_m)} \leq \big|\big| M_n-M_m\big| \big| e^{LT} \, .$

where the final inequality holds for $t_n$ such that $t_n-t_m \leq T$ , and we define $M_n = \sum_{k=1}^n \alpha_k \epsilon_k$ . Notice that $M_n$ , $n\in\mathbb N$ , is a martingale and further

$\mathbb E M_n^2 = \sum_{k=1}^n \alpha \mathbb E [ \epsilon_k^2] \leq K\sum_{k=1}^\infty \alpha_k^2 <\infty$

where $K =\max_k \mathbb E [ \epsilon_k^2]$ . Thus $M_n$ is an $L^2$ bounded Martingale and thus convergence with probability $1$ and in $L^2$ . Consequently $M_n$ is a cauchy sequence, meaning $\lim_{m\rightarrow\infty} \sup_{n\geq m} || M_n - M_m|| = 0$ with probability $1$ and in $L^2$ . Applying this to gives the required result

$\lim_{m\rightarrow\infty} \sup_{t\in [t_m,t_m+T]\cap \mathcal T} || z_m(t) - x(t) || = 0\, .$

QED.

Applying the o.d.e limit.

Before we focus on the proof of Proposition 1 it’s worth explaining how it can be applied. The main idea is to

Check that the o.d.e. convergence by showing gets close to the some desired set of points $\mathcal X^\star$ in $T$ time units for each initial condition $x_n$ , $n \in \mathbb N$ .
Then apply Proposition 1 to show that the stochastic approximation is also close to the o.d.e at time $T$ .

This is sufficient to show that the stochastic approximation converges to $\mathcal X^\star$ .

Here, we combine La Salle’s Invariance Principle, Theorem 2, with Proposition 1, though, in principle, we could have considered any of the ode results from Lyapunov Functions. We assume that the assumptions of Proposition 1 are satisfied. In addition we assume,

Almost surely,

$\sup_n || x_n || < \infty\, .$

Further we recall that for La Salle’s Invariance Principle, we assumed there was a Lyapunov function $L:\mathbb R^d \rightarrow \mathbb R_+$ such that

$L$ is continuously differentiable .
$g(x)\cdot \nabla L(x) \leq 0$ for all $x$ .
The sets $\{ x : L(x) \leq l\}$ are compact for all $l$ .

Also we defined $\mathcal X^\star := \{ x : g(x)\cdot \nabla L(x) =0\}$ . Recall that La Salle’s Invariance Principle stated that for all solutions to the o.d.e. $dz /dt = g(z(t))$ with $z(0) \in \{ x : L(x) \leq l\}$ there exists a $T$ such that $\max_{x^\star \in \mathcal X^\star} | z(t) - x^\star |\leq \epsilon$ for all $t \geq T$ .

Theorem 1. For the stochastic approximation scheme

$x_{n+1} = x_n + \alpha_n [g(x_n) + \epsilon_n ]$

described in Proposition 1 and given a Lyapunov function $L$ as described above, it holds, with probability $1$ ,

$x_n \xrightarrow[n\rightarrow \infty]{} \mathcal X^\star$

Proof. First we check that the o.d.e solutions $z_m(t)$ considered in Proposition 1 are going to satisfy the conditions of La Salle. In particular, La Salle’s result requires o.d.e. solutions to all belong to some compact set. Notice, since we assume that $C := \sup_n || x_n || < \infty$ we can let $l = \max \{ L(x) : ||x|| \leq X\}$ and take $\mathcal X = \{ x : L(x) \leq l \}$ . Since $L(z(t))$ is decreasing for all solutions to the o.d.e. We see that $z_m(t) \in \mathcal X$ for all $m$ and $t\geq t_m$ .

Next we set up the bounds from the two results. From La Salle’s Invariance Principle, we know that $\forall \epsilon>0$ $\exists T>0$ such that $\forall t \geq T$ and $\forall m$

$\min_{x^\star} | z_m (t_m +t ) - x^\star | \leq \epsilon \, .$

Taking this choice of $T$ , we know from Proposition 1 that there exists $m^\star$ s.t. $\forall m > m^\star$

$\sup_{t \in [t_m , t_m + 2T]\cap \mathcal T} || x(t) - z_m(t) || \leq \epsilon.$

Notice we can take $m^\star$ suitably large so that $\alpha_m =: t_m -t_{m-1} \leq T$ for all $m\geq m^\star$ . Notice this implies that

$\bigcup_{m : m \geq m^\star } [ t_m +T, t_m +2T] = [t_{m^\star} + T,\infty)\, .$

Now notice that if $n$ is such that $t_n \in [t_m+T,t_m+2T]$ for some $m \geq m^\star$ then combining the inequalities above, gives that

$\min_{x^\star} || x_n - x^\star || \leq || x_n - z_m(t_n) || + \min_{x^\star} || z_m(t_n) - x^\star || \leq 2\epsilon\,.$

Thus we see, with the union above, that for all $t_n \geq t_m +T$ it holds that $\min_{x^\star} || x_n - x^\star || \leq 2\epsilon$ . In other words $x_n \rightarrow \mathcal X^\star$ as required. QED.

References

The o.d.e approach to stochastic approximation was initiated by Ljung. Shortly after it is was extensively developed by Kushner, see below for two text book accounts. The arguments above loosely follow the excellent text of Borkar. We shorten the proof in several ways and consider $L^2$ convergence. A further text with a general treatment is Benveniste et al.

.Benveniste, M. Metivier, and P. Priouret. Adaptive algorithms and stochastic approxima- tions, volume 22. Springer Science & Business Media, 2012.

V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.

H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.

H. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained and un- constrained systems, volume 26. Springer Science & Business Media, 1978.

Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551-575, 1977.

We could also linearly interpolate between these terms to define $x(t)$ for all $t\in \mathbb R_+$ , but we choose not to do this as it provides no new insight and only serves to lengthen the proof.↩
Convergence in $L^2$ occur assuming $||\cdot||$ is the usual Euclidean distance.↩