Continuous Time Dynamic Programming

Discrete time Dynamic Programming was given in the post Dynamic Programming. We now consider the continuous time analogue.

Time is continuous $t\in\mathbb{R}_+$ ; $x_t\in \mathcal{X}$ is the state at time $t$ ; $a_t\in \mathcal{A}$ is the action at time $t$ ; Given function $f: \mathbb{R}_+\times\mathcal{X}\times \mathcal{A}_t \rightarrow \mathcal{X}$ , the state evolves according to a differential equation

$\label{cDP:Plant} \frac{dx_{t}}{dt}= f_t(x_t, a_t).$

This is called the Plant Equation. A policy $\pi$ chooses an action $\pi_t$ at each time $t$ . The (instantaneous) reward for taking action $a$ in state $x$ at time $t$ is $r_t(a,x)$ and $r_T(x)$ is the reward for terminating in state $x$ at time $T$ .

Def [Dynamic Program] Given initial state $x_0$ , a dynamic program is the optimization

Screenshot 2019-01-26 at 15.41.18.png

Further, let $C_\tau({\bf a})$ (Resp. $L_\tau(x_\tau)$ ) be the objective (Resp. optimal objective) for when the summation is started from $t=\tau$ , rather than $t=0$ .

When a minimization problem where we minimize loss given the costs incurred is replaced with a maximization problem where we maximize winnings given the rewards received. The functions $L$ , $C$ and $c$ are replaced with notation $W$ , $R$ and $r$ .

Def [Hamilton-Jacobi-Bellman Equation] For a continuous-time dynamic program , the equation

$\label{cDP:HJB}\tag{HJB} 0= \min_{a\in\mathcal{A}} \left\{ c_t(x,a)+ \partial_t L_t(x) + f_t(x,a)\partial_x L_t(x) - \alpha L_t(x). \right\}$

is called the Hamilton-Jacobi-Bellman equation. It is the continuous time analogoue of the Bellman equation.

A Heuristic Derivation of the HJB Equation

We now argue why the Hamiliton-Jacobi-Bellman equation is a good candidate for the Bellman equation in continuous time.

A good approximation to the plant equation is

$\label{cDP:xaprox} x_{t+\delta} -x_{t} = \delta f_t(x_t,a_t)$

for $\delta>0$ small, and a good approximation for the above objective is

$\begin{aligned} \label{cDP:Ob} C({\bf a}) := \sum_{t\in\{0,\delta,...,(T-\delta)\} } (1-\alpha \delta)^{ {t}/{\delta}}c_t(x_t,a_t) \delta + (1-\alpha \delta)^{ {t}/{\delta}}c_T(x_T) \end{aligned}$ This follows from the definition of the Riemann Integral and we further use the fact that $(1-\alpha \delta)^{t/\delta}\rightarrow e^{-\alpha t}$ as $\delta\rightarrow 0$ .

The Bellman equation for the discrete time dynamic program with objective and plant equation is

$L_t(x) = \min_{a\in \mathcal{A}}\left\{ c_t(x,a)\delta + (1-\alpha \delta) L_{t+\delta}(x_t+\delta f_t(x,a)) \right\}$

If we minus $L_t(x)$ from each side in this Bellman equation and then divide by $\delta$ and let $\delta\rightarrow 0$ we get that

$0= \min_{a\in\mathcal{A}} \left\{ c_t(x,a)+ \partial_t L_t(x) + f_t(x,a)\partial_x L_t(x) - \alpha L_t(x)\, , \right\}$

where here we note that, by the Chain rule,

$\frac{(1-\alpha \delta) L_{t+\delta}(x+\delta f) - L_t(x)}{\delta} \xrightarrow[\delta \rightarrow 0]{ }\partial_t L_t(x) + f_t(x,a)\partial_x L_t(x) - \alpha L_t(x).$

Thus we derive the HJB equation as described above.

The following result shows that if we solve the HJB equation then we have an optimal policy.

Thrm 1 [Optimality of HJB] Suppose that a policy $\Pi$ has a value function $C_t(x,\Pi)$ that satisfies the HJB-equation for all $t$ and $x$ then, $\Pi$ is an optimal policy.

Proof. Using shorthand $C=C_t(\tilde{x}_t,\Pi)$ : $\begin{aligned} -\frac{d}{dt} \left(e^{-\alpha t} C_t(\tilde{x}_t,\Pi)\right)&=e^{-\alpha t} \left\{ c_t(\tilde{x}_t,\tilde{\pi}_t) - \left[ c_t(\tilde{x}_t,\tilde{\pi}_t) - \alpha C + f_t(\tilde{x}_t,\tilde{\pi}_t) \partial_x C + \partial_t C\right]\right\}\\ &\leq e^{-\alpha t} c_t(\tilde{x}_t,\tilde{\pi}_t) \end{aligned}$ The inequality holds since the term in the square brackets is the objective of the HJB equation, which is not maximized by $\tilde{\pi}_t$ . $\square$

Linear Quadratic Regularization

Def. [LQ problem] We consider a dynamic program of the form

Screenshot 2019-01-26 at 15.46.08.png

Here $x_t \in\mathbb{R}^n$ and $a_t\in\mathbb{R}^m$ . $A$ and $B$ are matrices. $Q$ and $R$ symmetric positive definite matrices. This an Linear-Quadratic problem (LQ problem).

Def [Riccarti Equation] The differential equation with

$\label{cDP:Riccarti}\tag{RicEq} \dot{\Lambda}(t) = -Q-\Lambda(t)A - A^\top \Lambda(t) + \Lambda(t) B R^{-1} B^{\top} \Lambda(t)\quad and\quad \Lambda(T)=Q_T.$ is called the Riccarti equation.

Thrm 2. For each time $t$ , the optimal action for the LQ problem is

$a_t = - R^{-1} B^\top \Lambda(t) x_t \, ,$

where $\Lambda(t)$ is the solution to the Riccarti equation.

Proof. The HJB equation for an LQ problem is

$0= \min_{a\in\mathbb{R}^m}\left\{ x^\top Qx + a^\top R a + \partial_t L_t(x) + (Ax + Ra)^\top \partial_x L_t(x) \right\}$

We now “guess” that the solution to above HJB equation is of the form $L_t(x)=x^\top \Lambda(t) x$ for some symmetric matrix $\Lambda(t)$ . Therefore

$\partial_x L_t(x) = 2 \Lambda(t) x \quad \text{and} \quad \partial_t L_t(x) = x^\top \dot{\Lambda}(t) x$

Substituting into the Bellman equation gives

$0 = \min_{a\in \mathbb R^n} \left\{ x^\top Q x + a^\top R a + x^\top \dot{\Lambda}(t) x + 2 x^\top \Lambda(x) ( A x + B a) \right\}\, .$

Differentiating with respect to $a$ gives the optimality condition

$2R a + 2 x^\top \Lambda(t) B =0$ which implies

$a= - R^{-1} B^\top \Lambda(t) x\, .$

Finally substituting into the Bellman equation, above, gives the expression

$0=x^\top \left[Q + \dot{\Lambda}(t)+ \Lambda(t)A + A^\top \Lambda(t) - \Lambda(t) B R^{-1} B^{\top} \Lambda(t) -Q \right]x\,.$

Thus the solution to the Riccarti equation has a cost function that solves the Bellman equation and thus by Theorem 1 the policy is optimal. $\square$

Continuous Time Dynamic Programming

A Heuristic Derivation of the HJB Equation

Linear Quadratic Regularization

One thought on “Continuous Time Dynamic Programming”

Leave a comment Cancel reply

A Heuristic Derivation of the HJB Equation

Linear Quadratic Regularization

Share this:

One thought on “Continuous Time Dynamic Programming”

Leave a comment Cancel reply