For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.

At a high level, for a Markov Decision Processes (where the transitions are known), an algorithm solving a Markov Decision Process involves two steps:

- (Policy Improvement) Here you take your initial policy and find a new improved new policy , for instance by solving Bellman’s equation:
- (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy :

## Value iteration

Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.

**Def.** Take and recursively calculate

for this is called *value iteration*.

We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy afterwards the value is .

The following result shows that Value Iteration converges to the optimal policy.

**Thrm 1.** For positive programming, i.e. where all rewards are positive and the discount factor belongs to the interval , then

Here is the optimal value function.

The following lemma is the key property for value iterations convergence, as well as a number of other algorithms.

**Lemma 1.** For reward function define

Show that if for all then for all

**Proof.** Clearly,

Now maximize both sides over .

**Proof of Thrm 1.** Note that Now, since , repeatedly applying Lemma [IDP:Cont_0] to the inequality gives that

Since is increasing for some function . We must show that is the optimal value function from the MDP.

Next note that is the optimal value function for the finite time MDP with rewards and duration . So and thus . Further, for any policy ,

Now take limits . Now maximize over to see that . So as required.

**Ex.** A robot is placed on the following grid.

The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.

**Ans.** Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)

## Policy Iteration

We consider a discounted program with rewards and discount factor .

**Def** [Policy Iteration] Given the stationary policy , we may define a new (improved) stationary policy, , by choosing for each the action that solves the following maximization

where is the value function for policy . We then calculate . Recall that, by Thrm 2 in Markov Chains: A Quick Review, this solves the equation

Policy iteration is the algorithm that takes

Starting from a stationary policy .

**Thrm 2.** Under Policy Iteration

and, for bounded programming,

By the optimality of with respect to we have

Thus from the last part of Thrm 2 in Markov Chains: A Quick Review, we know that . This show that Policy iteration improves solutions. Now we must show it improves to the optimal solution.

First note that

We can use the above inequality to show that the following process is a supermartingale

where is the optimal policy.^{1} To see taking expectations with respect to the optimal policy gives

Since is a supermartingale:

Therefore, as required,

**Ex.** Write a program that uses, Policy iteration to find the optimal policy for the robot below:

**Ans.**

- Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a -optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.↩