Algorithms for MDPs

For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.

At a high level, for a Markov Decision Processes (where the transitions P^{a}_{xy} are known), an algorithm solving a Markov Decision Process involves two steps:

  • (Policy Improvement) Here you take your initial policy \pi_0 and find a new improved new policy \pi, for instance by solving Bellman’s equation:
  • (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy \pi:

Value iteration

Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.

Def. Take V_0(x)=0 \forall x and recursively calculate

for s=1,2,.. this is called value iteration.

We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy \pi afterwards the value is V_s(\hat{X}).

The following result shows that Value Iteration converges to the optimal policy.

Thrm 1. For positive programming, i.e. where all rewards are positive and the discount factor \beta belongs to the interval (0,1], then

Here V(x) is the optimal value function.

The following lemma is the key property for value iterations convergence, as well as a number of other algorithms.

Lemma 1. For reward function R(x) define

Show that if R(x) \geq \tilde{R}(x) for all x\in \mathcal X then \mathcal L R(x) \geq \mathcal L \tilde{R}(x) for all x\in \mathcal X

Proof. Clearly,

Now maximize both sides over a\in\mathcal A. \square

Proof of Thrm 1. Note that V_1(x) = \max_a r(x,a) \geq 0 = V_0(x). Now, since V_{s+1}(x) = \mathcal L V_s (x), repeatedly applying Lemma [IDP:Cont_0] to the inequality V_1(x) \geq V_0(x) gives that

Since V_s(x) is increasing V_s(x) \nearrow V_\infty(x) for some function V_\infty. We must show that V_{\infty} is the optimal value function from the MDP.

Next note that V_s(\cdot) is the optimal value function for the finite time MDP with rewards r(x,a) and duration s. So V(x)\geq V_s(x) and thus V(x) \geq V_{\infty}(x). Further, for any policy \Pi,

Now take limits V_{\infty}(x) \geq R(x,\Pi). Now maximize over \Pi to see that V_\infty(x) \geq V(x). So V_\infty(x) = V(x) as required.


Ex. A robot is placed on the following grid.

Screenshot 2019-01-26 at 11.35.20.png

The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.

Ans. Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)

Screenshot 2019-01-26 at 11.35.27.png

Policy Iteration

We consider a discounted program with rewards r(x,a) and discount factor \beta \in (0,1).

Def [Policy Iteration] Given the stationary policy \Pi, we may define a new (improved) stationary policy, {\mathcal I}\Pi, by choosing for each x the action {\mathcal I}\Pi (x) that solves the following maximization

where R(x,\Pi) is the value function for policy \Pi. We then calculate R(x,\mathcal I \Pi). Recall that, by Thrm 2 in Markov Chains: A Quick Review, this solves the equation

Policy iteration is the algorithm that takes

Starting from a stationary policy \Pi_0.

Thrm 2. Under Policy Iteration

and, for bounded programming,

By the optimality of \mathcal I \Pi with respect to \Pi we have

Thus from the last part of Thrm 2 in Markov Chains: A Quick Review, we know that R(x,\Pi) \leq R(x,{\mathcal I}\Pi ). This show that Policy iteration improves solutions. Now we must show it improves to the optimal solution.

First note that

We can use the above inequality to show that the following process is a supermartingale

where \pi^*(x) is the optimal policy.1 To see taking expectations with respect to the optimal policy \pi^* gives

Since M_t is a supermartingale:

Screenshot 2019-01-26 at 11.34.56.png

Therefore, as required, \lim_{T\rightarrow \infty} R(x,\Pi_T) \geq V(x).


Ex. Write a program that uses, Policy iteration to find the optimal policy for the robot below:

Screenshot 2019-01-26 at 11.35.20.png

Ans.Screenshot 2019-01-26 at 11.35.27.png

  1. Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a \epsilon-optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: