Gradient Descent

We consider one of the simplest iterative procedures for solving the (unconstrainted) optimization

where f is a convex function twice differentiable function. For fixed x,v\in{\mathbb R}^p

If ||v||=1, the gradient is most negative when v\propto -\nabla f(x). Thus, starting from x_0 define the gradient descent algorithm:

  • We assume H(x) the Hessian matrix of f at x satisfies b ||v||^2\leq \frac{1}{2} v^T H(x)v\leq B ||v||^2, for some b,B>0.So, by a Taylor expansion

    v\in{\mathbb R}^p

  • We assume 0< \eta < B^{-1}. So the right-hand inequality above guarantees that f(x_{t+1})<f(x_t) for all t.
  • We assume there is a unique finite solution to our optimization.1

We now see our algorithm converges exponentially fast to its solution, so-called linear convergence.

Given the assumptions above there exists \kappa\in (0,1) and K>0 such that

Rewriting , we have

Using lower bound on the Hessian, we bound ||\nabla f(x_t)||^2 with the following Taylor expansion

C

ombining the last two inequalities, we have as required,

Finally for , we reapply our Hessian lower-bound assumption and use the above bound for f(x_{t}) - f(x^*):

Projected Gradient Descent

We consider a simple iterative procedure for solving a constrained optimization

where f is a convex function twice differentiable function and where {\mathcal X}\subset {\mathbb R}^p is some non-empty closed convex set, eg. \{ x \geq 0 : Ax = b\}. Like in Section [GD]we want to follow the steepest descent direction x_{t+1}=x_t - \eta\nabla f(x_t). However, such points need not belong to {\mathcal X}. Thus we consider the projection of x\in{\mathbb R}^p on to {\mathcal X}:

and then, from x_0, we define the projected gradient descent by

  • After projecting it need not be true that f(x_{t+1})<f(x_t). Thus we adjusted the step-size \eta_t>0 and our proof will study the distance between x_t and the optimal solution x^* rather than the gap f(x_t)-f(x^*).

• The key observation is that making a projection cannot increase distances

 

[PGD:proj]

For all x'\in {\mathcal X}, we must have (x'- P_{\mathcal X}(x))\cdot (x-P_{\mathcal X}(x)) \leq 0, i.e. the plane {\mathcal H}= \{z: (z- P_{\mathcal X}(x))\cdot (x-P_{\mathcal X}(x))=0\} separates x from {\mathcal X}. If this were not true, then we would have a contradiction, in particular, there would be a point on the line joining x' and P_{\mathcal X}(x) that is closer to x. We thus have

Adding together and then applying Cauchy-Schwartz implies

as required.

If the gradient of iterates are bounded, K=\max_{t\in{\mathbb N}}\{||\nabla f(x_t)||^2 \} < \infty then

Hence if \sum_{t=0}^\infty \eta_t =\infty, \sum_{t=0}^\infty \eta_t^2 <\infty and \max_{0\leq t \leq T} ||\nabla f(x_t)||^2<\infty then \lim_{T\rightarrow\infty} f(x^*) - \min_{0\leq t \leq T} f(x_t)=0.

 

Recurring the above expression yields Rearranging the above gives the required result.


  1. This assumption is not really needed, but it saves space.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s