Online Convex Optimization

We consider the setting of sequentially optimizing the average of a sequence of functions, so called online convex optimization.

For {\mathcal C} a convex set and discrete time t=1,2,3,..., we consider the setting where

  1. An unknown convex loss function l_{t} is set.
  2. the algorithm picks a point x\in{\mathcal S} and incurs a loss l_t(x).
  3. The gradient \nabla l_t (x_t) is revealed.

The quantity that we are interested in minimizing is the regret of the aggregate loss between a sequence of choices \{x_t\}_{t=1}^T, and a fixed choice x :

We now define a specific selection rule for x_t, which we called online mirror descent. Each mirror descent algorithm is defined by a convex function F:{\mathcal C}\rightarrow {\mathbb R}.

Mirror Descent

Initialize by setting by setting \theta_1=(0,...,0), then for t=1,2,...

  1. set
  2. As described above the algorithm the incurs loss l_t(x_t) and observes \nabla l_t(x_t) .
  3. The algorithm then selects

  • Notice by Fenchel-Duality, here that x_t=\nabla F^*(\theta_t) where F^* is the Legendre-Fenchel Transform of F. In other words, we don’t need to solve the optimization of x_t, in principle we can just choose \nabla F^*
    from an appropriate convex function F^*.
  • We will say that F is \beta-strongly convex if
  • We will say that G is \gamma-smooth if it is differentiable and its derivative is Lipschitz continuous with constant \gamma, i.e.
  • It is left as an exercise, to show that if F is \beta-strongly convex then its Legendre-Fenchel Transform F^* is \beta^{-1}– smooth.

Thrm: If F is \beta-strongly convex, then

Proof: Notice by the convexity of l_t, we can upper-bound the regret as follows

Where in the equality, we just introduced the notation g_t= \nabla l_t(x_t).
Now notice that given our definition of \theta_t in , x_t as defined in is given by

In some sense our choice of x_{t} is attempting to optimize the bound on the right-hand side of .

Now since F is \beta-strongly convex, then F^* is \beta^{-1}–smooth.
Where we note that, by definition, x_t=\nabla F^*(\theta_t). Further by the Fenchel-Young Inequality, we have that

where in the last inequality we note the definition of \theta_{T+1} is the sum of -g_1,...,-g_T. Summing the interpolation terms in and applying the above bound we have that

In the last equality, we apply the Fenchel-Young Inequality to F(x_1) once more. Now, rearranging the above inequality and applying our first bound , we have that

as required.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: