Thus far we have considered finite time Markov decision processes. We now want to solve MDPs of the form
We can generalize Bellman’s equation to infinite time, a correct guess at the form of the equation would, for instance, be
Previously we solved Markov Decisions Processes inductively with Bellman’s equation. In infinite time, we can not directly apply induction; however, we see that Bellman’s equation still holds and we can use this to solve our MDP. For now we will focus on the case of discounted programming: here we assume that
We will cover cases where later.
At this point it is useful define the concept of a -factor. A Q-factor of a policy is the reward that arises when we take action from state and then follow policy .
Def. [Q-Factor] The -factor of reward function is the value for taking action in state
Similarly the -factor for a policy , denoted by , is given by the above expression with . The Q-factor of the optimal policy is given by
The following result shows that if we have solved the Bellman equation then the solution and its associated policy is optimal.
Thrm. For a discounted program, the optimal policy satisfies
Moreover, if we find a function such that
then , i.e. the solution to the Bellman equation is unique, and we find a function such that
Then is optimal and the optimal value function.
Proof. We know that Applying limits as on both sides and bounded convergence theorem gives that
For the inequality, above, we maximize over . Now maximizing the left hand side over gives
At this point we have that the Bellman equation but with an inequality. We need to prove the inequality in the other direction. For this, we let be the policy that chooses action and then, from the next state , follows a policy which satisfies
We have that
The first inequality holds by the sub-optimality of and the second holds by the assumption on . Maximizing over , and taking gives
Thus we now have that
So at this point we know that the optimal value function satisfies the Bellman equation. For the next part of the result we need to show that the solution to this recursion is unique.
Suppose that is another solution to the Bellman equation. From the definition of a -factor and the Bellman recursion, and . Thus note that
Since the only solution to this inequality is and thus
So solutions to the Bellman equation are unique for discounted programming. Finally we must show that if we can find a policy that solves the Bellman equation, then it is optimal.
If we find a function and a function such that
then note that the MDP induced by is a Markov chain (with transition matrix ). Both and solve the equation . So by Prop 2 in Markov Chains: A Quick Review, .
It is worth collating together a similar result for -factors. Given the facts accrued about value function and Bellman’s equation. The following Proposition should not be too great a surprise (and can be skipped on first reading).
Prop. a) Stationary -factors satisfy the recursion
b) Bellman’s Equation can be re-expressed in terms of -factors as follows
The optimal value function satisfies
c) The operation
is a contraction with respect to the supremum norm, that is,
Proof. a) We can think of extending the state space of our MDP to include states as well as . In this new MDP we can assume that initially the MDP starts in state then moves to the state according to the transition probabilities . There after it remains in moving according to policy . Thus by Prop 2 in Markov Chains: A Quick Review
where is the reward function of policy . Further since is the value from taking instead of following policy to should also be clear that
Thus, as required,
b) Further it should be clear that the optimal value function for the extended MDP discussed has a Bellman equation of the form
Comparing the first equation above with the second, it should be clear that and substituting this back into the first equation gives as required
c) The proof of this part is already embedded in the previous Theorem. Note that