# Markov Decision Processes

• Markov Decisions Problems; Bellman’s Equation; Two examples

A Markov Decision Process is a Dynamic Program where the state evolves in a random/Markovian way.

As in the post on Dynamic Programming, we consider discrete times $t=0,1,...,T$, states $x\in\mathcal{X}$, actions $a\in\mathcal{A}_t$ and rewards $r_t(a_t,x_t)$. However, the plant equation and definition of a policy are slightly different.

Def 1 [Plant Equation] The state evolves according to functions $f_t: \mathcal{X}\times \mathcal{A}_t \times [0,1] \rightarrow \mathcal{X}$. Here

where $(U_t)_{t\geq 0}$ are IIDRVs uniform on $[0,1]$. This is called the Plant Equation. As noted in the equivalence above, we will often suppress dependence on $U_t$. Also, we will use the notation

Def 2 [Policy] A policy $\pi$ choses an action $\pi_t$ at each time $t$ as a function of past states $x_0,...,x_t$ and past actions $\pi_0,...,\pi_{t-1}$. We let $\mathcal{P}$ be the set of policies.

A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov Decision Process. Objective is to find a process that optimizes the following objective.

Def 3 [Markov Decision Problem] Given initial state $x_0$, a Markov Decision Problem is the following optimization

Further, let $R_\tau({x_\tau, \Pi})$ (Resp. $W_\tau(x_\tau)$) be the objective (Resp. optimal objective) for when the summation is started from time $t=\tau$ and state $x_\tau$, rather than $t=0$ and $x_0$.

Ex 1 [the Bellman Equation]Setting $W_T(x)=r_T(x)$ for $t=T-1,T-2,...,0$

The above equation is Bellman’s equation for a Markov Decision Process.

Ex 2 You need to sell a car. At every time $t=0,...,T-1$, you set a price $p_t$ and a customer then views the car. The probability that the customer buys a car at price $p$ is $D(p)$. If the car isn’t sold be time $T$ then it is sold for fixed price $W_T$, $W_T<1$. Maximize the reward from selling the car and find the recursion for the optimal reward when $D(p) = (1-p)_+$.

Ex 3 You own a call option with strike price $p$. Here you can buy a share at price $p$ making profit $x_t-p$ where $x_t$ is the price of the share at time $t$. The share must be exercised by time $T$. The price of stock $x_t$ satisfies for $\epsilon_t$ IIDRV with finite expectation. Show that there exists a decreasing sequence $\{a_t\}_{0\leq t \leq T}$ such that it is optimal to exercise whenever $x_s \geq a_s$ occurs.

Ans 1 $\mathcal{P}_t$, the set policies that can be implemented from time $t$ to $T$, is the product actions at time $t$ and the set of policies from time $t+1$ onward. That is $\mathcal{P}_t = \{ (\pi_t,\Pi): \Pi \in\mathcal{P}_{t+1}, \pi_t: \mathcal{X}^t \times \prod_{\tau = 0}^{t-1}\mathcal{A}_\tau \rightarrow \mathcal{A}_t \}$.

2nd equality uses structure of $\mathcal{P}_t$, takes the $r_t$ term out and then takes conditional expectations. 3rd equality takes the supremum over $\mathcal{P}_{t+1}$, which does not depend on $\pi_t$, inside the expectation and notes the supremum over $\pi_t$ is optimized at a fixed action $a\in\mathcal{A}$ (i.e. the past information did not help us.)

Ans 2  The optimal reward from time $t$ onward (assuming you still have the car) satisfies:

If $D(p)=(1-p)_+$ then $W_t= (1+W_{t+1})^2/4$. (Note the relation is monotone and $W_0$ converges to fixed point if $T\rightarrow \infty$.)

Ans 3 Taking the and subtracting $x$ gives

1) $W_t(x)\geq W_{t+1}(x)$, because all policies implemented from time $t+1$ are subset of policies from time $t$.

2) $W_t(x)-x$ is decreasing and cts in $x$ because $W_T(x)= \max \{x-p, 0\}$ is and, since expectation and maximum preserve these properties, so is $W_t$ by induction on .

Thus the point $a_t$ where $p=\mathbb{E}\left[W_{t+1}(a_t+\epsilon_t) -a_t\right]$ must decrease.