Online Convex Optimization

We consider the setting of sequentially optimizing the average of a sequence of functions, so called online convex optimization.

For ${\mathcal C}$ a convex set and discrete time $t=1,2,3,...$ , we consider the setting where

An unknown convex loss function $l_{t}$ is set.
the algorithm picks a point $x\in{\mathcal S}$ and incurs a loss $l_t(x)$ .
The gradient $\nabla l_t (x_t)$ is revealed.

The quantity that we are interested in minimizing is the regret of the aggregate loss between a sequence of choices $\{x_t\}_{t=1}^T$ , and a fixed choice $x$ :
$R_T(x):= \sum_{t=1}^T l_t(x_t) - \sum_{t=1}^T l_t(x)$

We now define a specific selection rule for $x_t$ , which we called online mirror descent. Each mirror descent algorithm is defined by a convex function $F:{\mathcal C}\rightarrow {\mathbb R}$ .

Mirror Descent

Initialize by setting by setting $\theta_1=(0,...,0)$ , then for $t=1,2,...$

set
$\label{OCO:xt} x_t= \argmax_{x\in {\mathcal C}} \{ x^\top \theta_t - F(\theta) \}.$
As described above the algorithm the incurs loss $l_t(x_t)$ and observes $\nabla l_t(x_t)$ .
The algorithm then selects
$\label{OCO:theta} \theta_{t+1}= \theta_t - \nabla l_t(x_t).$

Notice by Fenchel-Duality, here that $x_t=\nabla F^*(\theta_t)$ where $F^*$ is the Legendre-Fenchel Transform of $F$ . In other words, we don’t need to solve the optimization of $x_t$ , in principle we can just choose $\nabla F^*$
from an appropriate convex function $F^*$ .
We will say that $F$ is $\beta$ -strongly convex if
$F(y)-F(x) \geq (y-x)^\top \nabla F(x) -\frac{\beta}{2} || y- x||^2$
We will say that $G$ is $\gamma$ -smooth if it is differentiable and its derivative is Lipschitz continuous with constant $\gamma$ , i.e.
$||\nabla G(y) - \nabla G(x)|| \leq \gamma || y- x ||.$
It is left as an exercise, to show that if $F$ is $\beta$ -strongly convex then its Legendre-Fenchel Transform $F^*$ is $\beta^{-1}$ – smooth.

Thrm: If $F$ is $\beta$ -strongly convex, then
$R_T(x) \leq F(x) + F(x_1) + \frac{1}{2\beta} \sum_{t=1}^T || \nabla l_t(x_t) ||^2$

Proof: Notice by the convexity of $l_t$ , we can upper-bound the regret as follows
$%label{OCO:lexpand} \sum_{t=1}^T l_t(x_t) - \sum_{t=1}^T l_t(x)$ $\leq \sum_{t=1}^T (x_t^\top -x^\top) \nabla l_t (x_t)$ $= \sum_{t=1}^T x_t^\top g_t - \sum_{t=1}^T x^\top g_t.$

Where in the equality, we just introduced the notation $g_t= \nabla l_t(x_t)$ .
Now notice that given our definition of $\theta_t$ in , $x_t$ as defined in is given by
$x_{t+1} = \argmin_{x\in{\mathcal C}} \left\{ \sum_{\tau=1}^t x^\top g_\tau + F(x) \right\}.$
In some sense our choice of $x_{t}$ is attempting to optimize the bound on the right-hand side of .

Now since $F$ is $\beta$ -strongly convex, then $F^*$ is $\beta^{-1}$ –smooth. $\label{OCO:Fsum} F^*(\theta_{t+1}) - F^*(\theta_t) \leq -g_t^\top \nabla F^*(\theta_t) + \frac{1}{2\beta} || g_t||^2 = g_t^\top x_t + \frac{1}{2\beta} || g_t||^2.$
Where we note that, by definition, $x_t=\nabla F^*(\theta_t)$ . Further by the Fenchel-Young Inequality, we have that
$F^*(\theta_{T+1}) - F^*(\theta_1) \geq x^\top \theta_{T+1} - F(x) - F^*(\theta_1)$ $= - \sum_{t=1}^T x^\top g_t - F(x) - F^*(\theta_1)$
where in the last inequality we note the definition of $\theta_{T+1}$ is the sum of $-g_1,...,-g_T$ . Summing the interpolation terms in and applying the above bound we have that
$\sum_{t=1}^T x_t^\top g_t + \sum_{t=1}^T \frac{1}{2\beta}|| g_t||^2$ $\geq \sum_{t=1}^T F^*(\theta_{t+1}) - F^*(\theta_t)$ $\geq -\sum_{t=1}^T x^\top g_t - F(x) - F^*(\theta_1)$

$= -\sum_{t=1}^T x^\top g_t - F(x) - F(x_1)$
In the last equality, we apply the Fenchel-Young Inequality to $F(x_1)$ once more. Now, rearranging the above inequality and applying our first bound , we have that
$R_T(x) = \sum_{t=1}^T l_t(x_t) - \sum_{t=1}^T l_t(x)$ $\leq \sum_{t=1}^T x^\top_t g_t -\sum_{t=1}^T x^\top g_t$ $\leq + F(x) + F(x_1) + \sum_{t=1}^T \frac{1}{2\beta}|| g_t||^2$
as required.