For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.
At a high level, for a Markov Decision Processes (where the transitions are known), an algorithm solving a Markov Decision Process involves two steps:
- (Policy Improvement) Here you take your initial policy
and find a new improved new policy
, for instance by solving Bellman’s equation:
- (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy
:
Value iteration
Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.
Def. Take
and recursively calculate
for this is called value iteration.
We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy afterwards the value is
.
The following result shows that Value Iteration converges to the optimal policy.
Thrm 1. For positive programming, i.e. where all rewards are positive and the discount factor belongs to the interval
, then
Here is the optimal value function.
The following lemma is the key property for value iterations convergence, as well as a number of other algorithms.
Lemma 1. For reward function define
Show that if for all
then
for all
Proof. Clearly,
Now maximize both sides over .
Proof of Thrm 1. Note that Now, since
, repeatedly applying Lemma [IDP:Cont_0] to the inequality
gives that
Since is increasing
for some function
. We must show that
is the optimal value function from the MDP.
Next note that is the optimal value function for the finite time MDP with rewards
and duration
. So
and thus
. Further, for any policy
,
Now take limits . Now maximize over
to see that
. So
as required.
Ex. A robot is placed on the following grid.
The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.
Ans. Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)
Policy Iteration
We consider a discounted program with rewards and discount factor
.
Def [Policy Iteration] Given the stationary policy , we may define a new (improved) stationary policy,
, by choosing for each
the action
that solves the following maximization
where is the value function for policy
. We then calculate
. Recall that, by Thrm 2 in Markov Chains: A Quick Review, this solves the equation
Policy iteration is the algorithm that takes
Starting from a stationary policy .
Thrm 2. Under Policy Iteration
and, for bounded programming,
By the optimality of with respect to
we have
Thus from the last part of Thrm 2 in Markov Chains: A Quick Review, we know that . This show that Policy iteration improves solutions. Now we must show it improves to the optimal solution.
First note that
We can use the above inequality to show that the following process is a supermartingale
where is the optimal policy.1 To see taking expectations with respect to the optimal policy
gives
Since is a supermartingale:
Therefore, as required,
Ex. Write a program that uses, Policy iteration to find the optimal policy for the robot below:
Ans.
- Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a
-optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.↩