- Positive Programming, Negative Programming & Discounted Programming.
- Optimality Conditions.
Thus far we have considered finite time Markov decision processes. We now want to solve MDPs of the form
(Notice rewards no longer depend on time.)
Def 1. [Positive, Negative, and Bounded programming] [IDP:PosNegDis]
Maximizing positive rewards, is called positive programming.
Minimizing positive losses, , is called negative programming.
Maximizing bounded discounted rewards, & is called discounted programming.
Def 2. [Minimizing Losses]So far we have considered the maximization of rewards; however often we want to minimize losses or costs. When we do so we will use the following notation:
Ex 1. Show that, for positive programming,
Ex 2. [Continued] Show that, for discounted programming,
Ex 3. [Continued] Show that
for a negative program.
Note that negative programming is not simply multiplying positive programming by minus one, because from  we see that the terms go up and over the optimal loss, while for positive programming each iteration moves towards the optimal objective .
Bellman’s Equation for positive programming
We now discuss Bellman’s equation in the infinite time horizon setting. Previously we solved Markov Decisions Processes inductively with Bellman’s equation. In infinite time, we can not directly apply induction; however, we see that Bellman’s equation still holds and we can use this to solve our MDP.
Thrm 1. Consider a positive program or a discounted program. Given the limit is well defined for each policy , the optimal policy satisfies
Moreover, if we find a function such that
and we find a function such that
Then is optimal and the optimal value function.
The above theorem covers the cases of positive and discounted programming. The case of negative programming is a little more subtle. We, now, prove Thrm 1. On first reading, you may wish to take this theorem as a given and skip to .
Ex 4. [Proof of Thrm 1] Show that
where is the policy after by policy after taking action from state .
Ex 5. [Continued] Show that
In the following exercise, we let be the policy that chooses action and then, from the next state , follows a policy which satisfies
Ex 6. [Continued] Show that
Ex 7. [Continued]
At this point we have shown the first part of Thrm 1. Now we need to show that a policy that satisfies the Bellman Equation is also optimal.
Ex 8. [Thrm 1, 2nd part] Show that if if we find a function and a function such that
Ex 9. For positive and discounted programming, suppose that is a policy satisfying
where here is the random variable representing the state reached by policy after steps.
Ex 10. [Continued] Now show that
and thus show that the policy is optimal.
The analogous result to Thrm [IDP:Bellman] for Negative programming is much weaker. This can be see from [IDP:CostLimit] when compared with [[IDP:RewardLimit]]. Essentially interations of negative programming over shoot the optimal value function.
Def 3. A policy is called a stationary policy if its action only depends on the current state (and is non-random and does not depend on time).
Ex 11. Consider a MDP with a finite number of actions and assume the Bellman equation has a solution. Show that there is a stationary policy solving the Bellman equation.
Thrm 2. Consider a negative program. Given the limit is well defined for each policy , the optimal policy satisfies
Moreover, any stationary policy that solves the Bellman equation:
So the Bellman equation is still correct, but as the above result suggests, simply finding a solution to the Bellman equation is not sufficient. We need to find the optimal solution first and then we need to solve with a stationary policy.
Ex 12. Show that the optimal value function satisfies
Ex 13. Argue that the stationary policy , described in Thrm 2 satisfies
Ex 14. [Continued] Show that is optimal i.e.
Ex 15. A gambler has pounds and wants to win pounds. The gambler can bet any less than or equal to . The gambler wins with probability and loses with probability and wins or loses . The game and when either or is reached. Assuming that , argue that it is always optimal for the gambler to gamble Gamble 1 pound.
Ex 16. [Geometric stopping or discount factor] Consider a positive program and suppose that after an independent geometrically distributed parameter number of steps the MDP enters an exit state with zero reward. Here a policy has reward function Argue that this has the same rewards as a discounted program with discount factor .
Ans 1. Apply the Monotone Convergence Theorem.
Ans 2. Apply the Bounded Convergence Theorem.
Ans 3. Again apply Monotone Convergence Theorem.
Ans 4. We know that
Applying limits as on both sides and monotone convergence theorem gives the results.
Ans 5. By  and the optimality of
Now maximize the left hand side of the above inequality.
Ans 6. We have that
The first inequality holds by the sub-optimality of and the second holds by the assumption on .
Ans 7. Take the inequality in , maximize over , and take gives
The above inequality and  give the result.
Ans 8. It is clear that
Thus by Ex 8 in Markov chains, we have that .
Ans 9. Suppose that policy follows actions over states . Then applying our initial inequality in , we have that
In the second inequality we again apply our initial inequality (this time to the term . We repeat this times. We, then, note that the terms involving are the total reward of policy up until time .
Ans 10. By  we have that
Since is an arbitrary policy and has higher reward, must be optimal.
Ans 11. As actions are finite, for each there exists a solving
Ans 12. Convince yourself that [4-7] apply in this case.
So the policy has lower cost, thus is optimal.
Ans 15. The Bellman equation for this problem is
Let be the reward for the policy that always bets pound. You can see that solves the recursion
You can check that solves the Bellman Equation above: I.e. check
where the maximum above is maximized at . (Hint: verify by differentiation, that the function is decreasing.)
The solves the Bellman Equation and so by Thrm [IDP:Bellman] it is optimal to always bet pound.