- Markov Decisions Problems; Bellman’s Equation; Two examples

A Markov Decision Process is a Dynamic Program where the state evolves in a random/Markovian way.

As in the post on Dynamic Programming, we consider discrete times , states , actions and rewards . However, the plant equation and definition of a policy are slightly different.

**Def 1** [Plant Equation] The state evolves according to functions . Here

where are IIDRVs uniform on . This is called the Plant Equation. As noted in the equivalence above, we will often suppress dependence on . Also, we will use the notation

**Def 2** [Policy] A policy choses an action at each time as a function of past states and past actions . We let be the set of policies.

A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov Decision Process. Objective is to find a process that optimizes the following objective.

**Def 3** [Markov Decision Problem] Given initial state , a Markov Decision Problem is the following optimization

Further, let (Resp. ) be the objective (Resp. optimal objective) for when the summation is started from time and state , rather than and .

**Ex 1** [the Bellman Equation]Setting for

The above equation is Bellman’s equation for a Markov Decision Process.

**Ex 2** You need to sell a car. At every time , you set a price and a customer then views the car. The probability that the customer buys a car at price is . If the car isn’t sold be time then it is sold for fixed price , . Maximize the reward from selling the car and find the recursion for the optimal reward when .

**Ex 3** You own a call option with strike price . Here you can buy a share at price making profit where is the price of the share at time . The share must be exercised by time . The price of stock satisfies for IIDRV with finite expectation. Show that there exists a decreasing sequence such that it is optimal to exercise whenever occurs.

## Answers

**Ans 1** , the set policies that can be implemented from time to , is the product actions at time and the set of policies from time onward. That is .

2nd equality uses structure of , takes the term out and then takes conditional expectations. 3rd equality takes the supremum over , which does not depend on , inside the expectation and notes the supremum over is optimized at a fixed action (i.e. the past information did not help us.)

**Ans 2** The optimal reward from time onward (assuming you still have the car) satisfies:

If then . (Note the relation is monotone and converges to fixed point if .)

**Ans 3** Taking the and subtracting gives

1) , because all policies implemented from time are subset of policies from time .

2) is decreasing and cts in because is and, since expectation and maximum preserve these properties, so is by induction on .

Thus the point where must decrease.