- Markov Decisions Problems; Bellman’s Equation; Two examples
A Markov Decision Process is a Dynamic Program where the state evolves in a random/Markovian way.
As in the post on Dynamic Programming, we consider discrete times , states
, actions
and rewards
. However, the plant equation and definition of a policy are slightly different.
Def 1 [Plant Equation] The state evolves according to functions . Here
where are IIDRVs uniform on
. This is called the Plant Equation. As noted in the equivalence above, we will often suppress dependence on
. Also, we will use the notation
Def 2 [Policy] A policy choses an action
at each time
as a function of past states
and past actions
. We let
be the set of policies.
A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov Decision Process. Objective is to find a process that optimizes the following objective.
Def 3 [Markov Decision Problem] Given initial state , a Markov Decision Problem is the following optimization
Further, let (Resp.
) be the objective (Resp. optimal objective) for when the summation is started from time
and state
, rather than
and
.
Ex 1 [the Bellman Equation]Setting for
The above equation is Bellman’s equation for a Markov Decision Process.
Ex 2 You need to sell a car. At every time , you set a price
and a customer then views the car. The probability that the customer buys a car at price
is
. If the car isn’t sold be time
then it is sold for fixed price
,
. Maximize the reward from selling the car and find the recursion for the optimal reward when
.
Ex 3 You own a call option with strike price . Here you can buy a share at price
making profit
where
is the price of the share at time
. The share must be exercised by time
. The price of stock
satisfies
for
IIDRV with finite expectation. Show that there exists a decreasing sequence
such that it is optimal to exercise whenever
occurs.
Answers
Ans 1 , the set policies that can be implemented from time
to
, is the product actions at time
and the set of policies from time
onward. That is
.
2nd equality uses structure of , takes the
term out and then takes conditional expectations. 3rd equality takes the supremum over
, which does not depend on
, inside the expectation and notes the supremum over
is optimized at a fixed action
(i.e. the past information did not help us.)
Ans 2 The optimal reward from time onward (assuming you still have the car) satisfies:
If then
. (Note the relation is monotone and
converges to fixed point if
.)
Ans 3 Taking the and subtracting gives
1) , because all policies implemented from time
are subset of policies from time
.
2) is decreasing and cts in
because
is and, since expectation and maximum preserve these properties, so is
by induction on .
Thus the point where
must decrease.