- High level idea: Policy Improvement and Policy Evaluation.
- Value Iteration; Policy Iteration.
- Temporal Differences; Q-factors.
For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.
At a high level, for a Markov Decision Processes (where the transitions are known), an algorithm solving a Markov Decision Process involves two steps:
- (Policy Improvement) Here you take your initial policy
and find a new improved new policy
, for instance by solving Bellman’s equation:
- (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy
:
Value iteration
Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.
Def. [Value iteration] Take
and recursively calculate
for this is called value iteration.
We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy afterwards the value is
.
Similarly we can define value iteration for minimization problem:
Each iteration of value iteration improves the solution:
Ex 1. For reward function define
Show that if for all
then
for all
.
Ans 1.
Now maximize both sides over .
Ex 2. [Continued] Show that for value iteration with positive programming
Ans 2. Now repeatedly apply [1].
Ex 3. [Continued][IDP:Cont_2] Show that for value iteration with negative programming
Ans 3. Identical idea as [1-2].
So we know Value Iteration improves the value function on each step. We now need to argue the value iteration gets to the optimal solution.
Ex 4. Show that for discounted programming
Ans 4. Note that is the optimal solution to the finite time MDP with
steps, recall in Def [MDP:Def]. Thus,
Now let .
Ex 5. Show that for positive programming
Ans 5. Take any policy . Then
Now take limits . Now maximize over
.
Ex 6. Show that for negative programming with a finite number of actions
Ans 6. Same idea as [5].
Ex 7. [GridWorld] A robot is placed on the following grid.
The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.
Ans 7. Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)
Policy Iteration
We consider a discounted program with rewards and discount factor
.
Def 2. [Policy Iteration] Given the stationary policy , we may define a new (improved) stationary policy,
, by choosing for each
the action
that solves the following maximization
where is the value function for policy
. We then calculate
. Recall that for each
this solves the equations
Policy iteration is the algorithm that takes
Starting from some of the tree stationary policy .
Thrm 1. Under Policy Iteration
and, for bounded programming,
Ex 8 [Proof of Thrm 1] Under the policy iteration
(Hint: requires Ex 8 & Ex 10 from Markov Chains)
Ans 8. By Ex 8 from Markov Chains, and the optimality of with respect to
we have
Applying Ex 10 from Markov Chains gives the result.
Ex 9. [Continued]
Ans 9.
We now proceed via a Martingale argument. Define
where is the optimal policy.1
Ex 10. [Continued] Show that is a supermartingale with respect to
.
Ans 10. Taking expectations with respect to the optimal policy
The inequality at the end follows by [9].
Ex 11. [Continued]
Ans 11. Since is a supermartingale:
Therefore, as required,
Ex 12. [GridWorld, again] Write a program that uses, Policy iteration to find the optimal policy for the robot in [7].
Ans 12.
Temporal Difference Iteration
We now discuss how temporal differences, defined below, relate to value iteration and policy iteration. From define a parameterized set of policies that have value iteration and policy improvement as special cases.
Def 3 [Temporal Differences] For a MDP, reward function , and an action
, the temporal differences are
The following exercises show that the Bellman equation and the definition of a stationary policy can be phrased in terms of temporal differences.
Ex 13. [Bellman Equation with Temporal Differences] Show that if is not optimal value function then
Ex 14. [Continued] Show that if is optimal value function then
Ex 15. [Stationary rewards] Show that the reward function for a stationary policy satisfies,
We now consider how temporal differences are used to update our value function in value iteration and policy iteration.
Ex 16. [Policy improvement] Show that policy improvement for value iteration and policy iteration are given by
Ex 17. [Continued, Value iteration] Show that policy evaluation under value iteration, updates the value function to
according to
where is as given in the policy improvement step.
(i.e. the temporal difference gives the change in the value function.)
Ex 18. [Continued, Policy Iteration] Show that policy evaluation under policy improvement, update that value function to
according to
Lambda-Policy Improvement
Notice that both value iteration and policy improvement have exactly the same policy improvement step:
where is the value attributed to the previous policy. However, they differ in how the evaluate the policy. In particular, value iteration considers one step under the new policy:
where as policy iteration takes an infinite number of steps:
The Temporal Difference method takes a geometrically distributed2 number of steps:
where is a geometrically distributed random variable on
with probability of success
).
Ex 19. [Value function from switching policies] Let be value function of a stationary policy
and let
be value function of a stationary policy
.
Let be the policy that follows
for a geometrically distributed time with success probability
and there after follows policy
. Show that the value function of
satisfies
Ans 19. Should be obvious from discussion above.
Ex 20. [Continued, Bellman Equation] Argue that also satisfies the recursion
Ans 20. [ADP:TD_1] This is recursion for the value function of this Markov chain, see [8] from Markov Chains. At each jump, with probability ,
continues and, with probability
,
stops and takes value
.
Ex 21. [Continued, relation with Temporal Differences] Show that
Ans 21. We use the shorthand &
. From [19],
Q-factors
Def 4. [Q-Factor] The -factor of a stationary policy
is the from taking action
and then following policy
there after, that is
The Q-factor (of the optimal policy) is given by
Ex 22. Show that stationary -factors satisfy the recursion
Ex 23. Show that Bellman’s Equation can be re-expressed in terms of -factors as follows
Ex 24. Show that the optimal value function satisfies
- Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a
-optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.↩
- By keeping things geometrically distributed we preserve the Markov property. This would not hold for other distributions↩