# Markov Decision Processes

Markov decision processes are essentially the randomized equivalent of a dynamic program.

### A Random Example

Let’s first consider how to randomize the tree example introduced.

Below is a tree with a root node and four leaf nodes colored grey. At the route node you choose to go left or right. This incurs costs $4$ and $2$, respectively. Further, after making this decision there is a probability for reaching a leaf node. Namely, after going left the probabilities are $0.5$ & $0.75$, and for turning right, the probabilities are $0.25$ & $0.75$. For each leaf node there is there is a cost, namely, $2, 3, 6,$ and $1$.

Given you only know the probabilities (and not what happens when you choose left or right), you’d want to take the decision with lowest expected cost. The expected cost for left is $4+ 0.5 \times 2 + 0.3 \times 3 = 5.5$ and for right is $2+ 0.25 \times 6 + 0.75 \times 1 = 4.25$. So go right.

Below we now replace the numbers above with symbols. At the route node you can choose the action to go left or right. These, respective, decisions incur costs of $l_{\text{left}}$ and $l_{\text{right}}$. After choosing left, you will move to state $A$ with probability $p_{\text{left},A}$ or to state $B$ with probability $p_{\text{left},B}$ and similarly choosing right states $C$ & $D$ can be reached with probabilities $p_{\text{left},C}$ & $p_{\text{left},D}$. After reaching node $A$ (resp. $B$,$C$,$D$) the total expected cost thereafter is $L_A$ (resp. $L_B$, $L_C$, $L_D$).

Show that the optimal expected cost from the route node, $L_R$ satisfies

where here the random variable $X_a$ denotes the state in $\{A,B,C,D\}$ reached after action is taken. The cost from choosing “left“ is :

and the cost for choosing “right” is:

The optimal cost is the minimum of these two is

where here the random variable $X_a$ denotes the state in $\{A,B,C,D\}$ reached after action is taken. Notice how we abstracted away the future behaviour after arriving at $A$, $B$, $C$, $D$. Into a single cost for each state: $L_A$, $L_B$, $L_C$, $L_D$. And we can propagate this back to get the costs at the route state $R$. I.e. we can essentially apply the same principle as dynamic programming here.

## Definitions

A Markov Decision Process (MDP) is a Dynamic Program where the state evolves in a random (Markovian) way.

Def [Markov Decision Process] Like with a dynamic program, we consider discrete times $t=0,1,...,T$, states $x\in\mathcal{X}$, actions $a\in\mathcal{A}_t$ and rewards $r_t(a,x)$. However, the plant equation and definition of a policy are slightly different.

Like with a Markov chain, the state evolves as a random function of the current state and action, $f: \mathcal{X}\times \mathcal{A}_t \times [0,1] \rightarrow \mathcal{X}$. Here

where $(U_t)_{t\geq 0}$ are IIDRVs uniform on $[0,1]$. This is called the Plant Equation.

A policy $\pi$ choses an action $\pi_t$ at each time $t$ as a function of past states $x_0,...,x_t$ and past actions $\pi_0,...,\pi_{t-1}$. We let $\mathcal{P}$ be the set of policies. A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov Decision Process.

As noted in the equivalence above, we will usually suppress dependence on $U_t$. Also, we will use the notation

where here and here after we use $\hat{X}$ to denote the next state (after taking action $a$ in state $x$). Notice in both equalities above, the term on the right depends on only one random variable, $U$.

Objective is to find a process that optimizes the expected reward.

Def [Markov Decision Problem] Given initial state $x_0$, a Markov Decision Problem is the following optimization Further, let $R_\tau({x_\tau, \Pi})$ (Resp. $V_\tau(x_\tau)$) be the objective (Resp. optimal objective) for when the summation is started from time $t=\tau$ and state $X_\tau=x_\tau$, rather than $t=0$ and $X_0=x_0$. We often call $V$ to value function of the MDP.

The next result shows that the Bellman equation follows essentially as before but now we have to take account for the expected value of the next state.

Def [Bellman Equation] Setting $V_T(x)=r_T(x)$ for $t=T-1,T-2,...,0$

The above equation is Bellman’s equation for a Markov Decision Process.

Let $\mathcal{P}_t$ be the set policies that can be implemented from time $t$ to $T$. Notice it is the product actions at time $t$ and the set of policies from time $t+1$ onward. That is $\mathcal{P}_t = \{ (\pi_t,\Pi): \Pi \in\mathcal{P}_{t+1}, \pi_t: \mathcal{X}^t \times \prod_{\tau = 0}^{t-1}\mathcal{A}_\tau \rightarrow \mathcal{A}_t \}$.

2nd equality uses structure of $\mathcal{P}_t$, takes the $r_t$ term out and then takes conditional expectations. 3rd equality takes the supremum over $\mathcal{P}_{t+1}$, which does not depend on $\pi_t$, inside the expectation and notes the supremum over $\pi_t$ is optimized at a fixed action $a\in\mathcal{A}$ (i.e. the past information did not help us.)

Ex. You need to sell a car. At every time $t=0,...,T-1$, you set a price $p_t$ and a customer then views the car. The probability that the customer buys a car at price $p$ is $D(p)$. If the car isn’t sold be time $T$ then it is sold for fixed price $V_T$, $V_T<1$. Maximize the reward from selling the car and find the recursion for the optimal reward when $D(p) = (1-p)_+$.

Ex. You own a call option with strike price $p$. Here you can buy a share at price $p$ making profit $X_t-p$ where $x_t$ is the price of the share at time $t$. The share must be exercised by time $T$. The price of stock $X_t$ satisfies for $\epsilon_t$ IIDRV with finite expectation. Show that there exists a decresing sequence $\{a_t\}_{0\leq t \leq T}$ such that it is optimal to exercise whenever $X_s \geq a_s$ occurs.

Ex. You own an expensive fish. Each day you are offered a price for the fish according to a distribution density $f(x)$. You make the accept or reject this offer. With probability $1-p$ the fish dies that day. Find the policy that maximizes the profit from selling fish.

Ex. Consider an MDP where rewards are now random, i.e. after we have specified the state and action the reward $r(x,a)$ is still an independent random variable. Argue that this is the same as a MDP with non-random rewards given by

Ex. Indiana Jones is trapped in a room in a temple. There are $n$ passages that he can try and escape from. If he attempts to escape from passage $i\in \{1,...,n\}$ then either: he esacapes with probability $p_p$; he dies with probability $q_i$; or with probability $r_i=1-p_i-q_i$ the passage is a deadend and he returns to the room which he started from. Determine the order of passages which Indiana Jones must try in order to maximize his probability of escape.

Ex. You invest in properties. The total value of these properties is $x_t$ in year $t=1,...,T$.
Each year $t$, you gain rent of $rx_t$ and you choose to consume a proportion $a_t\in [0,1]$ of this rent. The remaining proportion is reinvested in buying new property.
Further you pay mortgage payments of $m x_t$ which are deducted from your consumed wealth. Here $m.
Your objective is to maximize the wealth consumed over $T$ years.

Briefly explaining why we can express this problem as a finite time dynamic program with

prove that if $W_{T-s}(x) = x \rho_{s}$ for some constant $\rho_s$ then