We consider one of the simplest iterative procedures for solving the (unconstrainted) optimization

where $f$ is a convex function twice differentiable function. For fixed $x,v\in{\mathbb R}^p$

If $||v||=1$, the gradient is most negative when $v\propto -\nabla f(x)$. Thus, starting from $x_0$ define the gradient descent algorithm:

• We assume $H(x)$ the Hessian matrix of $f$ at $x$ satisfies $b ||v||^2\leq \frac{1}{2} v^T H(x)v\leq B ||v||^2$, for some $b,B>0$.So, by a Taylor expansion

$v\in{\mathbb R}^p$

• We assume $0< \eta < B^{-1}$. So the right-hand inequality above guarantees that $f(x_{t+1}) for all $t$.
• We assume there is a unique finite solution to our optimization.1

We now see our algorithm converges exponentially fast to its solution, so-called linear convergence.

Given the assumptions above there exists $\kappa\in (0,1)$ and $K>0$ such that

Rewriting , we have

Using lower bound on the Hessian, we bound $||\nabla f(x_t)||^2$ with the following Taylor expansion

C

ombining the last two inequalities, we have as required,

Finally for , we reapply our Hessian lower-bound assumption and use the above bound for $f(x_{t}) - f(x^*)$:

We consider a simple iterative procedure for solving a constrained optimization

where $f$ is a convex function twice differentiable function and where ${\mathcal X}\subset {\mathbb R}^p$ is some non-empty closed convex set, eg. $\{ x \geq 0 : Ax = b\}$. Like in Section [GD]we want to follow the steepest descent direction $x_{t+1}=x_t - \eta\nabla f(x_t)$. However, such points need not belong to ${\mathcal X}$. Thus we consider the projection of $x\in{\mathbb R}^p$ on to ${\mathcal X}$:

and then, from $x_0$, we define the projected gradient descent by

• After projecting it need not be true that $f(x_{t+1}). Thus we adjusted the step-size $\eta_t>0$ and our proof will study the distance between $x_t$ and the optimal solution $x^*$ rather than the gap $f(x_t)-f(x^*)$.

• The key observation is that making a projection cannot increase distances

[PGD:proj]

For all $x'\in {\mathcal X}$, we must have $(x'- P_{\mathcal X}(x))\cdot (x-P_{\mathcal X}(x)) \leq 0$, i.e. the plane ${\mathcal H}= \{z: (z- P_{\mathcal X}(x))\cdot (x-P_{\mathcal X}(x))=0\}$ separates $x$ from ${\mathcal X}$. If this were not true, then we would have a contradiction, in particular, there would be a point on the line joining $x'$ and $P_{\mathcal X}(x)$ that is closer to $x$. We thus have

Adding together and then applying Cauchy-Schwartz implies

as required.

If the gradient of iterates are bounded, $K=\max_{t\in{\mathbb N}}\{||\nabla f(x_t)||^2 \} < \infty$ then

Hence if $\sum_{t=0}^\infty \eta_t =\infty$, $\sum_{t=0}^\infty \eta_t^2 <\infty$ and $\max_{0\leq t \leq T} ||\nabla f(x_t)||^2<\infty$ then $\lim_{T\rightarrow\infty} f(x^*) - \min_{0\leq t \leq T} f(x_t)=0$.

Recurring the above expression yields Rearranging the above gives the required result.

1. This assumption is not really needed, but it saves space.