- Markov Decisions Problems; Bellman’s Equation; Two examples
A Markov Decision Process is a Dynamic Program where the state evolves in a random/Markovian way.
As in the post on Dynamic Programming, we consider discrete times , states , actions and rewards . However, the plant equation and definition of a policy are slightly different.
Def 1 [Plant Equation] The state evolves according to functions . Here
where are IIDRVs uniform on . This is called the Plant Equation. As noted in the equivalence above, we will often suppress dependence on . Also, we will use the notation
Def 2 [Policy] A policy choses an action at each time as a function of past states and past actions . We let be the set of policies.
A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov Decision Process. Objective is to find a process that optimizes the following objective.
Def 3 [Markov Decision Problem] Given initial state , a Markov Decision Problem is the following optimization
Further, let (Resp. ) be the objective (Resp. optimal objective) for when the summation is started from time and state , rather than and .
Ex 1 [the Bellman Equation]Setting for
The above equation is Bellman’s equation for a Markov Decision Process.
Ex 2 You need to sell a car. At every time , you set a price and a customer then views the car. The probability that the customer buys a car at price is . If the car isn’t sold be time then it is sold for fixed price , . Maximize the reward from selling the car and find the recursion for the optimal reward when .
Ex 3 You own a call option with strike price . Here you can buy a share at price making profit where is the price of the share at time . The share must be exercised by time . The price of stock satisfies for IIDRV with finite expectation. Show that there exists a decreasing sequence such that it is optimal to exercise whenever occurs.
Ans 1 , the set policies that can be implemented from time to , is the product actions at time and the set of policies from time onward. That is .
2nd equality uses structure of , takes the term out and then takes conditional expectations. 3rd equality takes the supremum over , which does not depend on , inside the expectation and notes the supremum over is optimized at a fixed action (i.e. the past information did not help us.)
Ans 2 The optimal reward from time onward (assuming you still have the car) satisfies:
If then . (Note the relation is monotone and converges to fixed point if .)
Ans 3 Taking the and subtracting gives
1) , because all policies implemented from time are subset of policies from time .
2) is decreasing and cts in because is and, since expectation and maximum preserve these properties, so is by induction on .
Thus the point where must decrease.