Continuous Time Dynamic Programming

Discrete time Dynamic Programming was given in the post Dynamic Programming. We now consider the continuous time analogue.

Time is continuous t\in\mathbb{R}_+; x_t\in \mathcal{X} is the state at time t; a_t\in \mathcal{A} is the action at time t; Given function f: \mathbb{R}_+\times\mathcal{X}\times \mathcal{A}_t \rightarrow \mathcal{X}, the state evolves according to a differential equation

This is called the Plant Equation. A policy \pi chooses an action \pi_t at each time t. The (instantaneous) reward for taking action a in state x at time t is r_t(a,x) and r_T(x) is the reward for terminating in state x at time T.

Def [Dynamic Program] Given initial state x_0, a dynamic program is the optimization

Screenshot 2019-01-26 at 15.41.18.png

Further, let C_\tau({\bf a}) (Resp. L_\tau(x_\tau)) be the objective (Resp. optimal objective) for when the summation is started from t=\tau, rather than t=0.

When a minimization problem where we minimize loss given the costs incurred is replaced with a maximization problem where we maximize winnings given the rewards received. The functions L, C and c are replaced with notation W, R and r.

Def [Hamilton-Jacobi-Bellman Equation] For a continuous-time dynamic program , the equation

is called the Hamilton-Jacobi-Bellman equation. It is the continuous time analogoue of the Bellman equation.

A Heuristic Derivation of the HJB Equation

We now argue why the Hamiliton-Jacobi-Bellman equation is a good candidate for the Bellman equation in continuous time.

A good approximation to the plant equation is

for \delta>0 small, and a good approximation for the above objective is

This follows from the definition of the Riemann Integral and we further use the fact that (1-\alpha \delta)^{t/\delta}\rightarrow e^{-\alpha t} as \delta\rightarrow 0.

The Bellman equation for the discrete time dynamic program with objective and plant equation is

If we minus L_t(x) from each side in this Bellman equation and then divide by \delta and let \delta\rightarrow 0 we get that

where here we note that, by the Chain rule,

Thus we derive the HJB equation as described above.

The following result shows that if we solve the HJB equation then we have an optimal policy.

Thrm 1 [Optimality of HJB] Suppose that a policy \Pi has a value function C_t(x,\Pi) that satisfies the HJB-equation for all t and x then, \Pi is an optimal policy.

Proof. Using shorthand C=C_t(\tilde{x}_t,\Pi): The inequality holds since the term in the square brackets is the objective of the HJB equation, which is not maximized by \tilde{\pi}_t. \square

Linear Quadratic Regularization

Def. [LQ problem] We consider a dynamic program of the form

Screenshot 2019-01-26 at 15.46.08.png

Here x_t \in\mathbb{R}^n and a_t\in\mathbb{R}^m. A and B are matrices. Q and R symmetric positive definite matrices. This an Linear-Quadratic problem (LQ problem).

Def [Riccarti Equation] The differential equation with

is called the Riccarti equation.

Thrm 2. For each time t, the optimal action for the LQ problem is

where \Lambda(t) is the solution to the Riccarti equation.

Proof. The HJB equation for an LQ problem is

We now “guess” that the solution to above HJB equation is of the form L_t(x)=x^\top \Lambda(t) x for some symmetric matrix \Lambda(t). Therefore

Substituting into the Bellman equation gives

Differentiating with respect to a gives the optimality condition

 which implies

Finally substituting into the Bellman equation, above, gives the expression

Thus the solution to the Riccarti equation has a cost function that solves the Bellman equation and thus by Theorem 1 the policy is optimal. \square

One thought on “Continuous Time Dynamic Programming”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: