Algorithms for MDPs

For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.

At a high level, for a Markov Decision Processes (where the transitions $P^{a}_{xy}$ are known), an algorithm solving a Markov Decision Process involves two steps:

(Policy Improvement) Here you take your initial policy $\pi_0$ and find a new improved new policy $\pi$ , for instance by solving Bellman’s equation: $\pi(x) \in \argmax_{a\in \mathcal A} \left\{ r(x,a) + \beta \mathbb{E}_{x,a} \left[ R(\hat{X},\pi_0) \right] \right\}$
(Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy $\pi$ : $R(x,\pi) = \mathbb E^\pi_{x} \left[ \sum_{t=0}^\infty \beta r(X_t,\pi(X_t))\right]$

Value iteration

Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.

Def. Take $V_0(x)=0$ $\forall x$ and recursively calculate

$\begin{aligned} \pi_{s+1}(x) \in & \argmax_{a\in {\mathcal A} } \left\{ r(x,a) + \beta \mathbb{E}_{x,a} \left[ V_s(\hat{X}) \right] \right\} \\ V_{s+1}(x) & = \max_{a\in {\mathcal A} } \left\{ r(x,a) + \beta \mathbb{E}_{x,a} \left[ V_s(\hat{X}) \right] \right\}\end{aligned}$

for $s=1,2,..$ this is called value iteration.

We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy $\pi$ afterwards the value is $V_s(\hat{X})$ .

The following result shows that Value Iteration converges to the optimal policy.

Thrm 1. For positive programming, i.e. where all rewards are positive and the discount factor $\beta$ belongs to the interval $(0,1]$ , then

$0 \leq V_s(x) \leq V_{s+1}(x) \nearrow V(x), \quad \text{as}\quad s\rightarrow \infty\,.$

Here $V(x)$ is the optimal value function.

The following lemma is the key property for value iterations convergence, as well as a number of other algorithms.

Lemma 1. For reward function $R(x)$ define

$\mathcal L R(x) = \max_{a\in {\mathcal A} } \left\{ r(x,a) + \beta \mathbb{E}_{x,a} \left[ R(\hat{X}) \right] \right\}.$

Show that if $R(x) \geq \tilde{R}(x)$ for all $x\in \mathcal X$ then $\mathcal L R(x) \geq \mathcal L \tilde{R}(x)$ for all $x\in \mathcal X$

Proof. Clearly,

$r(x,a) + \beta \mathbb{E}_{x,a} \left[ R(\hat{X}) \right] \geq r(x,a) + \beta \mathbb{E}_{x,a} \left[ \tilde{R}(\hat{X}) \right].$

Now maximize both sides over $a\in\mathcal A$ . $\square$

Proof of Thrm 1. Note that $V_1(x) = \max_a r(x,a) \geq 0 = V_0(x).$ Now, since $V_{s+1}(x) = \mathcal L V_s (x)$ , repeatedly applying Lemma [IDP:Cont_0] to the inequality $V_1(x) \geq V_0(x)$ gives that

$V_{s+1}(x) \geq V_s(x)\, .$

Since $V_s(x)$ is increasing $V_s(x) \nearrow V_\infty(x)$ for some function $V_\infty$ . We must show that $V_{\infty}$ is the optimal value function from the MDP.

Next note that $V_s(\cdot)$ is the optimal value function for the finite time MDP with rewards $r(x,a)$ and duration $s$ . So $V(x)\geq V_s(x)$ and thus $V(x) \geq V_{\infty}(x)$ . Further, for any policy $\Pi$ ,

$V_s(x) \geq R_s(x,\Pi) \, .$

Now take limits $V_{\infty}(x) \geq R(x,\Pi)$ . Now maximize over $\Pi$ to see that $V_\infty(x) \geq V(x)$ . So $V_\infty(x) = V(x)$ as required.

$\square$

Ex. A robot is placed on the following grid.

Screenshot 2019-01-26 at 11.35.20.png

The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.

Ans. Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)

Screenshot 2019-01-26 at 11.35.27.png

Policy Iteration

We consider a discounted program with rewards $r(x,a)$ and discount factor $\beta \in (0,1)$ .

Def [Policy Iteration] Given the stationary policy $\Pi$ , we may define a new (improved) stationary policy, ${\mathcal I}\Pi$ , by choosing for each $x$ the action ${\mathcal I}\Pi (x)$ that solves the following maximization

${\mathcal I}\Pi (x) \in \argmax_{a\in{\mathcal A} } \; r(x,a) + \beta \mathbb{E}_{x,a} \left[ R(\hat{X},\Pi) \right]$

where $R(x,\Pi)$ is the value function for policy $\Pi$ . We then calculate $R(x,\mathcal I \Pi)$ . Recall that, by Thrm 2 in Markov Chains: A Quick Review, this solves the equation

$R(x,\mathcal I \Pi) = r(x,\mathcal I \Pi (x)) + \beta \mathbb{E}_{x,a} \left[ R(\hat{X},\mathcal I \Pi) \right]$

Policy iteration is the algorithm that takes

$\Pi_{n+1} = {\mathcal I} \Pi_n$

Starting from a stationary policy $\Pi_0$ .

Thrm 2. Under Policy Iteration

$R(x,\Pi_{n+1}) \geq R(x,\Pi_n)$

and, for bounded programming,

$R(x,\Pi_n) \nearrow V(x) \qquad \text{as} \quad n \rightarrow \infty$

By the optimality of $\mathcal I \Pi$ with respect to $\Pi$ we have

$\begin{aligned} R(x,\Pi) = r(x,\pi(x)) + \beta \mathbb E_{x,\pi(x)} \left[ R(\hat{X},\Pi) \right] \leq r(x,\mathcal I \Pi (x)) + \beta \mathbb E_{x,\mathcal I \pi(x)} \left[ R(\hat{X},\Pi) \right]\end{aligned}$

Thus from the last part of Thrm 2 in Markov Chains: A Quick Review, we know that $R(x,\Pi) \leq R(x,{\mathcal I}\Pi )$ . This show that Policy iteration improves solutions. Now we must show it improves to the optimal solution.

First note that

$\begin{aligned} r(x,a) + \beta \mathbb E_{x,a} \left[ R(\hat{X},\Pi) \right] & \leq r(x,\mathcal I \pi(x)) + \beta \mathbb E_{x,\mathcal I(x)} \left[ R(\hat{X},\Pi) \right] \\ & { \leq } r(x, \mathcal I \pi(x)) + \beta \mathbb E_{x,\mathcal I \Pi} \left[ R(\hat{X},\mathcal I \Pi) \right] = R(x,\mathcal I \Pi) .\end{aligned}$

We can use the above inequality to show that the following process is a supermartingale

$M_t = \beta^t R(X_t,\Pi_{T-t}) + \sum_{s=0}^{t-1} \beta^s r(X_s,\pi^*(X_s))$

where $\pi^*(x)$ is the optimal policy.¹ To see taking expectations with respect to the optimal policy $\pi^*$ gives

$\begin{aligned} &\mathbb E^* \left[ M_{t+1} - M_t | \mathcal F_t \right] \\ & = \beta^t \mathbb E^* \left[ \beta R(X_{t+1},\Pi_{T-t-1}) + r(X_t, \pi^*(X)t) - R(X_t,\Pi_{T-t}) \Big| \mathcal F_t \right] \\ & = \beta^t \mathbb E^* \left[ \beta \mathbb E^*_{X_t,\pi^*(X_t)} \left[ \beta R(\hat{X},\Pi_{T-t-1}) + r(X_t,\pi^*(X_t)) - R(X_t,\Pi_{T-t}) \right] \Big| \mathcal F_t \right] \\ & \leq 0\, .\end{aligned}$

Since $M_t$ is a supermartingale:

Screenshot 2019-01-26 at 11.34.56.png

Therefore, as required, $\lim_{T\rightarrow \infty} R(x,\Pi_T) \geq V(x).$

$\square$

Ex. Write a program that uses, Policy iteration to find the optimal policy for the robot below:

Screenshot 2019-01-26 at 11.35.20.png

Ans.

Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a $\epsilon$ -optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.↩

Value iteration

Policy Iteration

Share this:

Leave a comment Cancel reply