- High level idea: Policy Improvement and Policy Evaluation.
- Value Iteration; Policy Iteration.
- Temporal Differences; Q-factors.

For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.

At a high level, for a Markov Decision Processes (where the transitions are known), an algorithm solving a Markov Decision Process involves two steps:

- (Policy Improvement) Here you take your initial policy and find a new improved new policy , for instance by solving Bellman’s equation:

- (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy :

## Value iteration

Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.

**Def.** [Value iteration] Take and recursively calculate

for this is called *value iteration*.

We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy afterwards the value is .

Similarly we can define value iteration for minimization problem:

Each iteration of value iteration improves the solution:

**Ex 1.** For reward function define

Show that if for all then for all .

**Ans 1.**

Now maximize both sides over .

**Ex 2. **[Continued] Show that for value iteration with positive programming

**Ans 2.** Now repeatedly apply [1].

**Ex 3. **[Continued][IDP:Cont_2] Show that for value iteration with negative programming

**Ans 3.** Identical idea as [1-2].

So we know Value Iteration improves the value function on each step. We now need to argue the value iteration gets to the optimal solution.

**Ex 4.** Show that for discounted programming

**Ans 4.** Note that is the optimal solution to the finite time MDP with steps, recall in Def [MDP:Def]. Thus,

Now let .

**Ex 5.** Show that for positive programming

**Ans 5.** Take any policy . Then

Now take limits . Now maximize over .

**Ex 6.** Show that for negative programming with a finite number of actions

**Ans 6.** Same idea as [5].

**Ex 7.** [GridWorld] A robot is placed on the following grid.

The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.

**Ans 7.** Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)

## Policy Iteration

We consider a discounted program with rewards and discount factor .

**Def 2.** [Policy Iteration] Given the stationary policy , we may define a new (improved) stationary policy, , by choosing for each the action that solves the following maximization

where is the value function for policy . We then calculate . Recall that for each this solves the equations

Policy iteration is the algorithm that takes

Starting from some of the tree stationary policy .

**Thrm 1.** Under Policy Iteration

and, for bounded programming,

Ex 8 [Proof of Thrm 1] Under the policy iteration

(Hint: requires Ex 8 & Ex 10 from Markov Chains)

**Ans 8.** By Ex 8 from Markov Chains, and the optimality of with respect to we have

Applying Ex 10 from Markov Chains gives the result.

**Ex 9.** [Continued]

**Ans 9.**

We now proceed via a Martingale argument. Define

where is the optimal policy.^{1}

**Ex 10.** [Continued] Show that is a supermartingale with respect to .

**Ans 10.** Taking expectations with respect to the optimal policy

The inequality at the end follows by [9].

**Ex 11.** [Continued]

**Ans 11.** Since is a supermartingale:

Therefore, as required,

**Ex 12.** [GridWorld, again] Write a program that uses, Policy iteration to find the optimal policy for the robot in [7].

**Ans 12.**

## Temporal Difference Iteration

We now discuss how temporal differences, defined below, relate to value iteration and policy iteration. From define a parameterized set of policies that have value iteration and policy improvement as special cases.

**Def 3 **[Temporal Differences] For a MDP, reward function , and an action , the temporal differences are

The following exercises show that the Bellman equation and the definition of a stationary policy can be phrased in terms of temporal differences.

**Ex 13.** [Bellman Equation with Temporal Differences] Show that if is not optimal value function then

**Ex 14.** [Continued] Show that if is optimal value function then

**Ex 15.** [Stationary rewards] Show that the reward function for a stationary policy satisfies,

We now consider how temporal differences are used to update our value function in value iteration and policy iteration.

**Ex 16.** [Policy improvement] Show that policy improvement for value iteration and policy iteration are given by

**Ex 17.** [Continued, Value iteration] Show that policy evaluation under value iteration, updates the value function to according to

where is as given in the policy improvement step.

(i.e. the temporal difference gives the change in the value function.)

**Ex 18.** [Continued, Policy Iteration] Show that policy evaluation under policy improvement, update that value function to according to

### Lambda-Policy Improvement

Notice that both value iteration and policy improvement have exactly the same policy improvement step:

where is the value attributed to the previous policy. However, they differ in how the evaluate the policy. In particular, value iteration considers one step under the new policy:

where as policy iteration takes an infinite number of steps:

The Temporal Difference method takes a geometrically distributed^{2} number of steps:

where is a geometrically distributed random variable on with probability of success ).

**Ex 19.** [Value function from switching policies] Let be value function of a stationary policy and let be value function of a stationary policy .

Let be the policy that follows for a geometrically distributed time with success probability and there after follows policy . Show that the value function of satisfies

**Ans 19.** Should be obvious from discussion above.

**Ex 20.** [Continued, Bellman Equation] Argue that also satisfies the recursion

**Ans 20.** [ADP:TD_1] This is recursion for the value function of this Markov chain, see [8] from Markov Chains. At each jump, with probability , continues and, with probability , stops and takes value .

**Ex 21.** [Continued, relation with Temporal Differences] Show that

**Ans 21.** We use the shorthand & . From [19],

## Q-factors

**Def 4.** [Q-Factor] The -factor of a stationary policy is the from taking action and then following policy there after, that is

The Q-factor (of the optimal policy) is given by

**Ex 22. **Show that stationary -factors satisfy the recursion

**Ex 23.** Show that Bellman’s Equation can be re-expressed in terms of -factors as follows

**Ex 24.** Show that the optimal value function satisfies

- Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a -optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.↩
- By keeping things geometrically distributed we preserve the Markov property. This would not hold for other distributions↩