- Positive Programming, Negative Programming & Discounted Programming.
- Optimality Conditions.

Thus far we have considered finite time Markov decision processes. We now want to solve MDPs of the form

(Notice rewards no longer depend on time.)

**Def 1.** [Positive, Negative, and Bounded programming] [IDP:PosNegDis]

Maximizing positive rewards, is called *positive programming*.

Minimizing positive losses, , is called *negative programming*.

Maximizing bounded discounted rewards, & is called *discounted programming*.

**Def 2.** [Minimizing Losses]So far we have considered the maximization of rewards; however often we want to minimize losses or costs. When we do so we will use the following notation:

**Ex 1.** Show that, for positive programming,

**Ex 2. **[Continued] Show that, for discounted programming,

**Ex 3.** [Continued] Show that

for a negative program.

Note that negative programming is not simply multiplying positive programming by minus one, because from [3] we see that the terms go up and over the optimal loss, while for positive programming each iteration moves towards the optimal objective [1].

## Bellman’s Equation for positive programming

We now discuss Bellman’s equation in the infinite time horizon setting. Previously we solved Markov Decisions Processes inductively with Bellman’s equation. In infinite time, we can not directly apply induction; however, we see that Bellman’s equation still holds and we can use this to solve our MDP.

**Thrm 1. **Consider a positive program or a discounted program. Given the limit is well defined for each policy , the optimal policy satisfies

Moreover, if we find a function such that

and we find a function such that

Then is optimal and the optimal value function.

The above theorem covers the cases of positive and discounted programming. The case of negative programming is a little more subtle. We, now, prove Thrm 1. On first reading, you may wish to take this theorem as a given and skip to [15].

**Ex 4. **[Proof of Thrm 1] Show that

where is the policy after by policy after taking action from state .

**Ex 5.** [Continued] Show that

In the following exercise, we let be the policy that chooses action and then, from the next state , follows a policy which satisfies

**Ex 6.** [Continued] Show that

**Ex 7.** [Continued]

At this point we have shown the first part of Thrm 1. Now we need to show that a policy that satisfies the Bellman Equation is also optimal.

**Ex 8.** [Thrm 1, 2nd part] Show that if if we find a function and a function such that

then .

**Ex 9.** For positive and discounted programming, suppose that is a policy satisfying

show that

where here is the random variable representing the state reached by policy after steps.

**Ex 10.** [Continued] Now show that

and thus show that the policy is optimal.

## Negative Programming

The analogous result to Thrm [IDP:Bellman] for Negative programming is much weaker. This can be see from [IDP:CostLimit] when compared with [[IDP:RewardLimit]]. Essentially interations of negative programming over shoot the optimal value function.

**Def 3.** A policy is called a stationary policy if its action only depends on the current state (and is non-random and does not depend on time).

**Ex 11.** Consider a MDP with a finite number of actions and assume the Bellman equation has a solution. Show that there is a stationary policy solving the Bellman equation.

**Thrm 2**. Consider a negative program. Given the limit is well defined for each policy , the optimal policy satisfies

Moreover, any stationary policy that solves the Bellman equation:

is optimal.

So the Bellman equation is still correct, but as the above result suggests, simply finding a solution to the Bellman equation is not sufficient. We need to find the optimal solution first and then we need to solve with a stationary policy.

**Ex 12. **Show that the optimal value function satisfies

**Ex 13. **Argue that the stationary policy , described in Thrm 2 satisfies

**Ex 14. **[Continued] Show that is optimal i.e.

**Ex 15. **A gambler has pounds and wants to win pounds. The gambler can bet any less than or equal to . The gambler wins with probability and loses with probability and wins or loses . The game and when either or is reached. Assuming that , argue that it is always optimal for the gambler to gamble Gamble 1 pound.

**Ex 16. **[Geometric stopping or discount factor] Consider a positive program and suppose that after an independent geometrically distributed parameter number of steps the MDP enters an exit state with zero reward. Here a policy has reward function Argue that this has the same rewards as a discounted program with discount factor .

# Answers

**Ans 1.** Apply the Monotone Convergence Theorem.

**Ans 2. **Apply the Bounded Convergence Theorem.

**Ans 3. **Again apply Monotone Convergence Theorem.

**Ans 4.** We know that

Applying limits as on both sides and monotone convergence theorem gives the results.

**Ans 5.** By [4] and the optimality of

Now maximize the left hand side of the above inequality.

**Ans 6. **We have that

The first inequality holds by the sub-optimality of and the second holds by the assumption on .

**Ans 7.** Take the inequality in [6], maximize over , and take gives

The above inequality and [5] give the result.

**Ans 8. ** It is clear that

Thus by Ex 8 in Markov chains, we have that .

**Ans 9.** Suppose that policy follows actions over states . Then applying our initial inequality in [9], we have that

In the second inequality we again apply our initial inequality (this time to the term . We repeat this times. We, then, note that the terms involving are the total reward of policy up until time .

**Ans 10. **By [9] we have that

Since is an arbitrary policy and has higher reward, must be optimal.

**Ans 11. **As actions are finite, for each there exists a solving

**Ans 12.** Convince yourself that [4-7] apply in this case.

**Ans 13. **

**Ans 14. **

So the policy has lower cost, thus is optimal.

**Ans 15. **The Bellman equation for this problem is

Let be the reward for the policy that always bets pound. You can see that solves the recursion

You can check that solves the Bellman Equation above: I.e. check

where the maximum above is maximized at . (Hint: verify by differentiation, that the function is decreasing.)

The solves the Bellman Equation and so by Thrm [IDP:Bellman] it is optimal to always bet pound.

**Ans 16. **