Infinite Time Horizon, MDP

  • Positive Programming, Negative Programming & Discounted Programming.
  • Optimality Conditions.

Thus far we have considered finite time Markov decision processes. We now want to solve MDPs of the form

(Notice rewards no longer depend on time.)

Def 1. [Positive, Negative, and Bounded programming] [IDP:PosNegDis]\;

Maximizing positive rewards, r(x,a)\geq 0 is called positive programming.

Minimizing positive losses, l(x,a)\geq 0, is called negative programming.

Maximizing bounded discounted rewards, |r(x,a)|\leq B & \beta \in (0,1) is called discounted programming.

Def 2. [Minimizing Losses]So far we have considered the maximization of rewards; however often we want to minimize losses or costs. When we do so we will use the following notation:

Ex 1. Show that, for positive programming,

Ex 2. [Continued] Show that, for discounted programming,

Ex 3. [Continued] Show that

for a negative program.

Note that negative programming is not simply multiplying positive programming by minus one, because from [3] we see that the terms C_T(x,\Pi) go up and over the optimal loss, while for positive programming each iteration moves towards the optimal objective [1].


Bellman’s Equation for positive programming

We now discuss Bellman’s equation in the infinite time horizon setting. Previously we solved Markov Decisions Processes inductively with Bellman’s equation. In infinite time, we can not directly apply induction; however, we see that Bellman’s equation still holds and we can use this to solve our MDP.

Thrm 1. Consider a positive program or a discounted program. Given the limit R(x,\Pi) is well defined for each policy \Pi, the optimal policy V(x) satisfies

Moreover, if we find a function R(x) such that

and we find a function \pi(x) such that

Then \pi is optimal and R(x)=R(x,\pi)=V(x) the optimal value function.

The above theorem covers the cases of positive and discounted programming. The case of negative programming is a little more subtle. We, now, prove Thrm 1. On first reading, you may wish to take this theorem as a given and skip to [15].

Ex 4. [Proof of Thrm 1] Show that

where \hat{\Pi} is the policy after by policy \Pi after taking action \pi_0 from state x.

Ex 5. [Continued] Show that

In the following exercise, we let \pi_\epsilon be the policy that chooses action a and then, from the next state \hat{X}, follows a policy \hat{\Pi}_\epsilon which satisfies

Ex 6. [Continued] Show that

Ex 7. [Continued]

At this point we have shown the first part of Thrm 1. Now we need to show that a policy that satisfies the Bellman Equation is also optimal.

Ex 8. [Thrm 1, 2nd part] Show that if if we find a function R(x) and a function \pi(x) such that

then R(x,\pi) = R(x).

Ex 9. For positive and discounted programming, suppose that \Pi is a policy satisfying

show that

where here \tilde{X}_t is the random variable representing the state reached by policy \tilde{\Pi} after t steps.

Ex 10. [Continued] Now show that

and thus show that the policy \Pi is optimal.

 


Negative Programming

The analogous result to Thrm [IDP:Bellman] for Negative programming is much weaker. This can be see from [IDP:CostLimit] when compared with [[IDP:RewardLimit]]. Essentially interations of negative programming over shoot the optimal value function.

Def 3.  A policy \Pi is called a stationary policy if its action only depends on the current state (and is non-random and does not depend on time).

Ex 11. Consider a MDP with a finite number of actions and assume the Bellman equation has a solution. Show that there is a stationary policy solving the Bellman equation.

Thrm 2. Consider a negative program. Given the limit C_T(x,\Pi) is well defined for each policy \Pi, the optimal policy L(x) satisfies

Moreover, any stationary policy \Pi that solves the Bellman equation:

is optimal.

So the Bellman equation is still correct, but as the above result suggests, simply finding a solution to the Bellman equation is not sufficient. We need to find the optimal solution first and then we need to solve with a stationary policy.

Ex 12. Show that the optimal value function satisfies

Ex 13. Argue that the stationary policy \pi, described in Thrm 2 satisfies

Ex 14. [Continued] Show that \Pi is optimal i.e.

Ex 15. A gambler has i pounds and wants to win N pounds. The gambler can bet any j less than or equal to i. The gambler wins with probability p and loses with probability q = 1- p and wins j or loses j. The game and when either 0 or N is reached. Assuming that p>1/2, argue that it is always optimal for the gambler to gamble Gamble 1 pound.

 

Ex 16. [Geometric stopping or discount factor] Consider a positive program and suppose that after an independent geometrically distributed parameter \beta number of steps the MDP enters an exit state with zero reward. Here a policy \Pi has reward function Argue that this has the same rewards as a discounted program with discount factor \beta.


Answers

Ans 1. Apply the Monotone Convergence Theorem.

Ans 2. Apply the Bounded Convergence Theorem.

Ans 3. Again apply Monotone Convergence Theorem.

Ans 4. We know that

Applying limits as t\rightarrow \infty on both sides and monotone convergence theorem gives the results.

Ans 5. By [4] and the optimality of V(x)

Now maximize the left hand side of the above inequality.

Ans 6. We have that

The first inequality holds by the sub-optimality of \Pi_\epsilon and the second holds by the assumption on \hat{\Pi}_\epsilon.

Ans 7. Take the inequality in [6], maximize over a\in \mathcal A , and take \epsilon\rightarrow 0 gives

The above inequality and [5] give the result.

Ans 8.  It is clear that

Thus by Ex 8 in Markov chains, we have that R(x,\pi(x)) = R(x).

Ans 9. Suppose that policy \tilde{\Pi} follows actions \tilde{A}_0,...,\tilde{A}_{t-1} over states \tilde{X}_0=x, \tilde{X}_1,...,\tilde{X}_t. Then applying our initial inequality in [9], we have that

In the second inequality we again apply our initial inequality (this time to the term R(\tilde{X}_1,\Pi). We repeat this t-1 times. We, then, note that the terms involving r(\cdot,\cdot) are the total reward of policy \tilde{\Pi} up until time t.

Ans 10. By [9] we have that

Since \tilde \Pi is an arbitrary policy and \Pi has higher reward, \Pi must be optimal.

Ans 11. As actions are finite, for each x there exists a \pi(x) solving

Ans 12. Convince yourself that [4-7] apply in this case.

Ans 13. 

Ans 14. 

So the policy has lower cost, thus is optimal.

Ans 15. The Bellman equation for this problem is

Let R(i) be the reward for the policy that always bets 1 pound. You can see that R(i) solves the recursion

You can check that R(i) solves the Bellman Equation above: I.e. check

where the maximum above is maximized at 1. (Hint: verify by differentiation, that the function f(x) =-p(q/p)^x -q (q/p)^{-x} is decreasing.)

The R(i) solves the Bellman Equation and so by Thrm [IDP:Bellman] it is optimal to always bet 1 pound.

Ans 16. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: