# Algorithms for MDPs

• High level idea: Policy Improvement and Policy Evaluation.
• Value Iteration; Policy Iteration.
• Temporal Differences; Q-factors.

For infinite time MDPs, we cannot apply to induction on Bellman’s equation from some initial state – like we could for finite time MDP. So we need some algorithms to solve MDPs.

At a high level, for a Markov Decision Processes (where the transitions $P^{a}_{xy}$ are known), an algorithm solving a Markov Decision Process involves two steps:

• (Policy Improvement) Here you take your initial policy $\pi_0$ and find a new improved new policy $\pi$, for instance by solving Bellman’s equation:

• (Policy Evaluation) Here you find the value of your policy. For instance by finding the reward function for policy $\pi$:

## Value iteration

Value iteration provides an important practical scheme for approximating the solution of an infinite time horizon Markov decision process.

Def. [Value iteration] Take $V_0(x)=0$ $\forall x$ and recursively calculate

for $s=1,2,..$ this is called value iteration.

We can think of the two display equations above, respectively, as the policy improvement and policy evaluation steps. Notice, that we don’t really need to do the policy improvement step to do each iteration. Notice the policy evaluation step evalutes one action under the new policy $\pi$ afterwards the value is $V_s(\hat{X})$.

Similarly we can define value iteration for minimization problem:

Each iteration of value iteration improves the solution:

Ex 1. For reward function $R(x)$ define

Show that if $R(x) \geq \tilde{R}(x)$ for all $x\in \mathcal X$ then $\mathcal L R(x) \geq \mathcal L \tilde{R}(x)$ for all $x\in \mathcal X$.

Ans 1.

Now maximize both sides over $a\in\mathcal A$.

Ex 2. [Continued] Show that for value iteration with positive programming

Ans 2. $V_1(x) = \max_a r(x,a) \geq 0 = V_0(x).$ Now repeatedly apply [1].

Ex 3. [Continued][IDP:Cont_2] Show that for value iteration with negative programming

Ans 3. Identical idea as [1-2].

So we know Value Iteration improves the value function on each step. We now need to argue the value iteration gets to the optimal solution.

Ex 4. Show that for discounted programming

Ans 4. Note that $V_{s+1}$ is the optimal solution to the finite time MDP with $s+1$ steps, recall in Def [MDP:Def]. Thus,

Now let $s\rightarrow\infty$.

Ex 5. Show that for positive programming

Ans 5. Take any policy $\Pi$. Then

Now take limits $V_{\infty}(x) \geq R(x,\Pi)$. Now maximize over $\Pi$.

Ex 6. Show that for negative programming with a finite number of actions

Ans 6.  Same idea as [5].

Ex 7. [GridWorld] A robot is placed on the following grid.

The robot can chose the action to move left, right, up or down provided it does not hit a wall, in this case it stays in the same position. (Walls are colored black.) With probability 0.8, the robot does not follow its chosen action and instead makes a random action. The rewards for the different end states are colored above. Write a program that uses, Value Iteration to find the optimal policy for the robot.

Ans 7. Notice that the robot does not just take the shortest root. (I.e. some forward planning is required)

## Policy Iteration

We consider a discounted program with rewards $r(x,a)$ and discount factor $\beta \in (0,1)$.

Def 2. [Policy Iteration] Given the stationary policy $\Pi$, we may define a new (improved) stationary policy, ${\mathcal I}\Pi$, by choosing for each $x$ the action ${\mathcal I}\Pi (x)$ that solves the following maximization

where $R(x,\Pi)$ is the value function for policy $\Pi$. We then calculate $R(x,\mathcal I \Pi)$. Recall that for each $x$ this solves the equations

Policy iteration is the algorithm that takes

Starting from some of the tree stationary policy $\Pi_0$.

Thrm 1. Under Policy Iteration

and, for bounded programming,

Ex 8 [Proof of Thrm 1] Under the policy iteration

(Hint: requires Ex 8 & Ex 10 from Markov Chains)

Ans 8. By Ex 8 from Markov Chains, and the optimality of $\mathcal I \Pi$ with respect to $\Pi$ we have

Applying Ex 10 from Markov Chains gives the result.

Ex 9. [Continued]

Ans 9.

We now proceed via a Martingale argument. Define

where $\pi^*(x)$ is the optimal policy.1

Ex 10. [Continued] Show that $M_t$ is a supermartingale with respect to $\Pi^*$.

Ans 10. Taking expectations with respect to the optimal policy $\Pi^*$

The inequality at the end follows by [9].

Ex 11. [Continued]

Ans 11. Since $M_t$ is a supermartingale:

Therefore, as required,

Ex 12. [GridWorld, again] Write a program that uses, Policy iteration to find the optimal policy for the robot in [7].

Ans 12.

## Temporal Difference Iteration

We now discuss how temporal differences, defined below, relate to value iteration and policy iteration. From define a parameterized set of policies that have value iteration and policy improvement as special cases.

Def 3 [Temporal Differences] For a MDP, reward function $R$, and an action $a$, the temporal differences are

The following exercises show that the Bellman equation and the definition of a stationary policy can be phrased in terms of temporal differences.

Ex 13. [Bellman Equation with Temporal Differences] Show that if $R$ is not optimal value function then

Ex 14. [Continued] Show that if $R$ is optimal value function then

Ex 15. [Stationary rewards] Show that the reward function for a stationary policy $\pi$ satisfies,

We now consider how temporal differences are used to update our value function in value iteration and policy iteration.

Ex 16. [Policy improvement] Show that policy improvement for value iteration and policy iteration are given by

Ex 17. [Continued, Value iteration] Show that policy evaluation under value iteration, updates the value function $R=V_0$ to $V$ according to

where $\pi(x)$ is as given in the policy improvement step.

(i.e. the temporal difference gives the change in the value function.)

Ex 18. [Continued, Policy Iteration] Show that policy evaluation under policy improvement, update that value function $R=V_0$ to $V$ according to

### Lambda-Policy Improvement

Notice that both value iteration and policy improvement have exactly the same policy improvement step:

where $V_0(x)$ is the value attributed to the previous policy. However, they differ in how the evaluate the policy. In particular, value iteration considers one step under the new policy:

where as policy iteration takes an infinite number of steps:

The Temporal Difference method takes a geometrically distributed2 number of steps:

where $\tau_\lambda$ is a geometrically distributed random variable on $\mathbb Z_+$ with probability of success $(1-\lambda$).

Ex 19. [Value function from switching policies] Let $V^{(0)}(x)$ be value function of a stationary policy $\Pi^{(0)}$ and let $V^{(1)}(x)$ be value function of a stationary policy $\Pi^{(1)}$.

Let $\Pi^{(\lambda)}$ be the policy that follows $\Pi^{(1)}$ for a geometrically distributed time with success probability $1-\lambda$ and there after follows policy $\Pi^{(0)}$. Show that the value function of $\Pi^{(\lambda)}$ satisfies

Ans 19. Should be obvious from discussion above.

Ex 20. [Continued, Bellman Equation] Argue that $V^{(\lambda)}$ also satisfies the recursion

Ans 20. [ADP:TD_1] This is recursion for the value function of this Markov chain, see [8] from Markov Chains. At each jump, with probability $\lambda$, $X^{(1)}$ continues and, with probability $1-\lambda$, $X^{(1)}$ stops and takes value $V^{(0)}(\cdot)$.

Ex 21. [Continued, relation with Temporal Differences] Show that

Ans 21. We use the shorthand $r_t=r(X^{(1)}_t,\pi^{(1)}(X^{(1)}_t))$ & $V_t^{(0)}=V^{(0)}(X^{(0)}_t)$. From [19],

## Q-factors

Def 4. [Q-Factor] The $Q$-factor of a stationary policy $\pi$ is the from taking action $a$ and then following policy $\pi$ there after, that is

The Q-factor (of the optimal policy) is given by

Ex 22. Show that stationary $Q$-factors satisfy the recursion

Ex 23. Show that Bellman’s Equation can be re-expressed in terms of $Q$-factors as follows

Ex 24. Show that the optimal value function satisfies

1. Note we are implicity assuming an optimal stationary policy exists. We can remove this assumption by considering a $\epsilon$-optimal (non-stationary) policy. However, the proof is a little cleaner under our assumption.
2. By keeping things geometrically distributed we preserve the Markov property. This would not hold for other distributions