Diffusion Control Problems

We consider a continuous time analogue of Markov Decision Processes.

Time is continuous $t\in\mathbb{R}_+$ ; $X_t\in \mathbb{R}^n$ is the state at time $t$ ; $a_t\in \mathcal{A}$ is the action at time $t$ .

Def [Plant Equation] Given functions $\mu_t(X_t,a_t)=(\mu^i_t(X_t,a_t): i=1,..,n)$ and $\sigma_t(X_t,a_t)=(\sigma^{ij}_t(X_t,a_t): i=1,..,n, j=1,...,m )$ , the state evolves according to a stochastic differential equation

$dX_{t}= \mu_t(X_t, a_t) dt + \sigma_t(X_t,a_t) \cdot d B_t$

where $B_t$ is an $m$ -dimensional Brownian motion. This is called the Plant Equation.

A policy $\pi$ chooses an action $\pi_t$ at each time $t$ . (We assume that $\pi_t$ is adapted and previsible.) Let $\mathcal{P}$ be the set of policies. The (instantaneous) cost for taking action $a$ in state $x$ at time $t$ is $c_t(a,x)$ and $c_T(x)$ is the cost for terminating in state $x$ at time $T$ .

Def [Diffusion Control Problem] Given initial state $x_0$ , a dynamic program is the optimization Screenshot 2019-01-26 at 16.03.29.png Further, let $C_\tau(x,\Pi)$ (Resp. $L_\tau(x)$ ) be the objective (Resp. optimal objective) for when the integral is started from time $t=\tau$ with $X_t=x$ , rather than $t=0$ with $X_0=x$ .

Def [Hamilton-Jacobi-Bellman Equation] For a Diffusion Control Problem , the equation $\label{DCP:HJB}\tag{HJB} 0= \min_{a\in\mathcal{A}} \left\{ c_t(x,a)+ \partial_t L_t(x) + \mu_t(x,a)\cdot \partial_x L_t(x) + \frac{1}{2} [\sigma^\T \sigma]\cdot \partial_{xx} L_t(x) - \alpha L_t(x). \right\}$ is called the Hamilton-Jacobi-Bellman equation.¹ It is the continuous time analogue of the Bellman equation [[DP:Bellman]].

Heuristic Derivation of the HJB equation

We heuristically develop a Bellman equation for stochastic differential equations using our knowledge of the Bellman equation for Markov decision processes and our heuristic derivation of the Stochastic Integration. This is analogous to continuous time control.

Perhaps the main thing to remember is that (informally) the HJB equation is

$0=\min_{\text{actions}} \left\{ \text{$

Here Ito’s formula is applied to the optimal value function at time $t$ , $L_t(x)$ . This is much easier to remember (assuming you know Ito’s formula).

We suppose (for simplicity) that $X_t$ belongs to $\mathbb R$ and is driven by a one-dimensional Brownian motion. The plant equation in Def [DCP:Plant] is approximated by

$X_{t+\delta} - X_t = \mu_t(X_t,\pi_t) \delta + \sigma_t(X_t,\pi_t) (B_{T+\delta}-B_t)$

for small $\delta$ (recall ). Similarly the cost function in can be approximated by

$C_t(x,\Pi) \approx \mathbb{E} \Bigg[ \sum_{t\in \{0,\delta,...,T-\delta\}} \!\! (1-\alpha \delta)^{\frac{t}{\delta}} c_t(X_t,\pi_t) \delta + (1-\alpha \delta)^{\frac{T}{\delta}}c_T(X_T) \Bigg].$

This follows from the definition of a Riemann Integral and since $(1-{\alpha}{\delta})^\frac{t}{\delta} \rightarrow e^{-\alpha t}$ . The Bellman equation for this objective function and plant equation is satisfies

$L_t(x) = \min_{a\in\mathcal{A}} \left\{ c_t(x,a) \delta + (1-\alpha \delta) \mathbb{E}_{x, a} \left[L_{t+\delta}(X_{t+\delta})\right] \right\}.$

or, equivalently,

$0 = \min_{a\in\mathcal{A}} \left\{ c_t(x,a) + \frac{1}{\delta} \mathbb{E}_{x, a} \left[L_{t+\delta}(X_{t+\delta}) - L_t(x) \right] - \alpha \mathbb{E}_{x, a} \left[L_{t+\delta}(X_{t+\delta}) \right] \right\}.$

Now by Ito’s formula $L_t(X_t)$ can be approximated by

$\begin{aligned} &L_{t+\delta}(X_{t+\delta}) - L_t(X_t)\\ \approx & \left[ \partial_t L + \mu_t(X_t,\pi_t) \cdot \partial_x L +\frac{\sigma_t(X_t,\pi_t)^2}{2} \partial_{xx} L \right] \delta + \partial_x L \cdot \sigma_t(X_t,\pi_t) \cdot (B_{t+\delta} - B_t) \end{aligned}$

Thus

$\frac{1}{\delta} \mathbb{E}_{x, a} \left[L_{t+\delta}(X_{t+\delta}) - L_t(x) \right] = \partial_t L + \mu_t(X_t,\pi_t) \cdot \partial_x L +\frac{\sigma_t(X_t,\pi_t)^2}{2} \partial_{xx} L$

Substituting in this into the above Bellman equation and letting $\delta \rightarrow 0$ , we get, as required,

$0 = \min_{a\in\mathcal{A}} \left\{ c_t(x,a) + \partial_t L + \mu_t(x,a) \cdot \partial_x L +\frac{\sigma_t(x,a)^2}{2} \partial_{xx} L - \alpha L_{t}(x) \right\}.$

The following gives a rigorous proof that the HJB equation is the right object to consider for a diffusion control problem.

Thrm [Davis-Varaiya Martingale Prinicple of Optimality] Suppose that there exists a function $L_t(x)$ with $L_T(x)= e^{-\alpha T} c_T(x)$ and such that for any policy $\Pi$ with states $X_t$

$M_t = L_t(X_t) + \int_{0}^{t} e^{-\alpha \tau} c_{\tau} (X_\tau, \Pi) d\tau$

is a sub-martingale and, moreover that for some policy $\Pi^*$ , $M_t$ is a martingale then $\Pi^*$ is optimal and

$L_0(X_0) = \min_{\Pi \in {\mathcal P}} \mathbb{E} \left[ \int_{0}^{T} e^{-\alpha \tau} c_{\tau} (X_\tau, \pi_\tau) d\tau + c_T(X_T) \right].$

Since $M_t$ is a sub-martingale for all $\Pi$ , we have

$\begin{aligned} L_0(X_0)= M_0 \leq \mathbb E [M_T] = \underbrace{ \mathbb E_{X_0} \Big[ \int_0^T e^{-\alpha s} c_{\tau} (X_\tau , \Pi_\tau) d\tau + \underbrace{L_T(X_T)}_{C_T(X_T)} \Big] }_{ C(x_0,\Pi) } \end{aligned}$ Therefore $L_0(X_0) \leq C(X_0,\Pi)$ for all policies $\Pi$ .

If $M_t$ is a Martingale for policy $\Pi^*$ , then by the same argument $L_0(X_0) = C(X_0,\Pi^*)$ . Thus

$C(X_0,\Pi^*) =L_0(X_0) \leq C(X_0,\Pi)$

for all policies $\Pi$ and so $\Pi^*$ is optimal, and it holds that $L_0(X_0) = \min_{\Pi \in {\mathcal P}} \mathbb{E} \left[ \int_{0}^{T} e^{-\alpha \tau} c_{\tau} (X_\tau, \pi_\tau) d\tau + c_T(X_T) \right].$

$\square$

Here $[\sigma^\top \sigma]\cdot \partial_{xx} L_t(x)$ is the dot-product of the Hessian matrix $\partial_{xx} L_t(x)$ with $\sigma^\top \sigma$ . I.e. we multiply component-wise and sum up terms.↩