Optimal Stopping

Optimal Stopping Problems; One-Step-Look-Ahead Rule
The Secretary Problem.
Infinite Time Stopping

An Optimal Stopping Problem is an Markov Decision Process where there are two actions: $a=0$ meaning to stop, and $a=1$ meaning to continue. Here there are two types of costs

$c(x,a)=\begin{cases} \kappa(x),& \text{for }a=0\quad\textit{ (the stopping cost)}\\ c(x),& \text{for }a=1\quad\textit{ (the continuation cost)}, \end{cases}$

This defines a stopping problem.

Assuming that time is finite, the Bellman equation is

$C_s(x) = \min \left\{ k(x) , c(x) + \mathbb{E}_x [C_{s-1}(\hat{X})] \right\}$

for $s\in \mathbb N$ and $C_0(x)=k(x)$ .

Def 1. [OLSA rule] In the one step lookahead (OSLA) rule we stop when ever $x\in S$ where

$S=\{ x: k(x)\leq c(x) + \mathbb{E}_x[k(\hat{X})]\}.$

We call $S$ the stopping set. In words, you stop whenever it is better stop now rather than continue one step further and then stop.

Def 2. [Closed Stopping Set] We say the set $S\subset {\mathcal X}$ is closed, it once inside that said you cannot leave, i.e.

$P_{xy}=0,\qquad \forall x\in S, y\notin S.$

Ex 1. Given the set $S$ is closed, argue that if $C_{s-1}(x)=k(x)$ for $x\in S$ then $C_s(x)=k(x)$ .

Ans 1. If $x\in S$ then since $S$ is closed $\hat{X} \in S$ . In otherwords $C_{s-1}(\hat{X}) =k(\hat{X})$ . Therefore, in this case, Bellman’s equation becomes

$C_s(x) = \min \{ k(x), c(x) + \mathbb E_x [ C_{s-1}(\Hat{X})] \} = \min \{ k(x) , c(x) + \mathbb E_x \big[ k(\hat{X} ) \big] \} = k(x).$

The last inequality above follows by the definition of $x\in S$ .

Ex 2. [OS:Finite] If, for the finite time stopping problem, the set $S$ given by the one step lookahead rule is closed then the one step lookahead rule is an optimal policy. (Hint: Induction on $s$ .)

Ans 2. The OSLA rule is optimal for $s=1$ steps, since OSLA is exactly the optimal policy for one step.

Suppose that the result is try for upto $s-1$ steps. Now consider the Optimal Stopping Problem with $s$ steps.

If $x\notin S$ , then clearly it’s better to continue.

Ex 3. [The Secretary Problem] There are $N$ candidates for a secretary job. You interview candidates sequentially. After each interview, you must either accept or reject the candidate. We assume each candidate has the rank: $1,2,...,N$ And arrive for interview uniformly at random. Find the policy that maximises the probability that you hire the best candidate.

Ans 3. All that matters at each time is if the current candidate is the best so far. At time $t$ let

$x_t = \begin{cases} 1, & \text{if candidate is best so far,}\\ 0, & \text{otherwise,} \end{cases} \quad a_t = \begin{cases} 1, & \text{if candidate accepted,}\\ 0, & \text{if reject.} \end{cases}$

Since is uniform random where the best candidate is

$\mathbb P ( t\text{ is the best } | x_t=1 ) = \frac{{1}/{N}}{{1}/{t}} = \frac{t}{N}.$

Under the chosen policy, we let

$R_t = \mathbb P (\text{ select the best candidate } | \text{ rejected first } t )$

be our reward function. Now

$\begin{aligned} x_t = 1,\text{ w.p. } \frac{1}{t} \implies \begin{cases} a_t = 1 & \text{ gives expected reward } \frac{t}{N} \\ a_t = 0 & \text{ gives expected reward } R_{t+1} \end{cases} \\ x_t = 0, \text{ w.p. } 1-\frac{1}{t} \implies \begin{cases} a_t = 1 & \text{ gives expected reward } 0 \\ a_t = 0 & \text{ gives expected reward } R_{t+1}. \end{cases}\end{aligned}$

Thus the Bellman equation for the above problem is

$\begin{aligned} R_t & = \underbrace{ \frac{1}{t} \max \left\{ \frac{t}{N} , R_{t+1} \right\} }_{ \max\{ 1/N , R_{t+1}/t \} } + \left( 1-\frac{1}{t} \right) \underbrace{ \max \left\{ 0, R_{t+1} \right\} }_{ R_{t+1} } \\ & = \max \left\{ \frac{1-t}{t} R_{t+1} + \frac{1}{N} R_{t+1} \right\} \\ & = \begin{cases} R_{t+1}, & \text{ if } R_{t+1} \geq \frac{t}{N}, \\ \frac{t-1}{t} R_{t+1} + \frac{1}{N}, & \text{ if } R_{t+1} < \frac{t}{N}. \end{cases}\end{aligned}$

Notice that $R_t \geq R_{t+1}$ . Let $t^*$ be the smallest $t$ such that $R_{t^*} \geq (t^*-1)/N$ . Starting from $R_{N+1}=0$ note that so long as $R_{t+1} < \frac{t}{N}$ holds in second case in the above expression, we have that

$\frac{R_t}{t-1} = \frac{R_{t-1}}{t} + \frac{1}{N} \frac{1}{t-1}$

Thus

$\frac{R_{t^*}}{t^*-1} = \sum_{t=t^*}^N \frac{R_t}{t-1} - \frac{R_{t+1}}{t} = \sum_{t=t^*}^N \frac{1}{N} \frac{1}{t-1}.$

Thus our condition for the optimal $t^*$ is to take the smallest $t^*$ such that

$\sum_{t=t*}^N \frac{1}{t-1} \geq 1.$

In other words, the optimal policy is to interview the first $t^*$ candidates and then accept the next best candidate.

Ex 4. [The Secretary Problem, continued] Argue that as $N\rightarrow \infty$ , the optimal policy is to interview $e^{-1}$ of the candidates and then to accept the next best candidate.

Ans 4. From [3], the optimal condition is

$\sum_{t=t*}^N \frac{1}{t-1} \geq 1.$

We know that as $N\rightarrow \infty$

$\sum_{t=t^*}^N \frac{1}{t-1} \sim \int_{t^*}^N \frac{1}{t} dt = \log N - \log t^*.$

Thus $\log N - \log t^* \geq 1$ for $t^*/N =e^{-1}$ .

Ex 5. You look for a parking space on street, each space is free with probability $p=1-q$ . You can’t tell if space is free until you reach it. Once at space you must decide to stop or continue. From position $s$ ( $s$ spaces from your destination), the cost of stopping is $s$ . The cost of passing your destination without parking is $D$ .

Ans 5. Let

$x_s = \begin{cases} 1,&\text{ if space $s$ is free,}\\ 0,&\text{otherwise.} \end{cases}$

Here stopping means take the next free parking space. The Bellman equation is

$\begin{aligned} C_s(0) & = p C_{s-1}(1) + q C_{s-1}(0) \\ C_s(1) & = \min \left\{ s , pC_{s-1}(1)+qC_{s-1}(0) \right\}\end{aligned}$

Consider the stopping set

$S = \{ s: s \leq k(s-1)\}$

where $k(s)$ is the cost of taking the next available space. Note that

$k(s) = ps + q k(s-1) \quad \text{and} \quad k(0) =qD.$

Recursion of this form have solution

$k(s) = -\frac{q}{p} + s + (D+\frac{1}{p} ) q^{s+1}$

Therefore

$S=\{ s : (Dp+1) q^s \geq 1 \}.$

Since $s$ is decreasing, this set if clearly closed. Therefore the optimal policy is to take the next available space once $(Dp+1) q^s \geq 1$ holds.

Ex 6. In a game show a contestant is asked a series of 10 questions. For each question $q=1,...,10$ there is a reward $r_q$ for answering the question correctly. With probability $p_q$ the contestant answers the question correctly. After correctly answering a question, the contestant can choose to stop and take their total winnings home or they can continue to the next question $q+1$ . However, if the contestant answers a question incorrectly then the contestant looses all of their winnings. The probability of winning each round is decreasing and is such that the expected reward from each round, $p_q r_q$ , is constant.

i) Write down the Bellman equation for this problem.

ii) Using the One-Step-Look-Ahead rule, or otherwise, find the optimal policy of the contestant.

Ans 6. The Bellman equation for this problem is

$R_t(x) = \max\{ x, p_t R_{t+1}(x+r_t)\}$

The stopping set for this problem is

$S = \{ (x,t) : x \geq p_t ( x+ r_t)\} = \{ (x,t) : x \geq r_t p_t / (1-p_t) \}$

Since by assumption $r_t p_t = c$ and $p_t \searrow 0$ (therefore $c/1-p_t)$ is decreasing) and $x_t \nearrow$ the set S is closed. Thus the OSLA rule is optimal for this problem.

Ex 7. [Whittle’s Burglar] A burglar robs houses over $N$ nights. At any night the burglar may choose to retire and thus take home his total earnings. On the $t$ th night house he robs has a reward $r_t$ where $r_t$ is an iidrv with mean $\bar{r}$ . Each night the probability that he is caught is $p$ and if caught he looses all his money. Find the optimal policy for the burglar’s retirement. (Hint: OLSA)

Ex 7. [Bruss’ Odds Algorithm] You sequentially treat patients $t=1,...,T$ with a new trail treatment. The probability of success is $p_t=1-q_t$ . We must minimize the number of unsuccessful treatments while treating all patients for which the trail is will be successful. (i.e. if we label $1$ for success and $0$ for failure, we want to stop on the last $1$ ). Argue, using the One-Step-Look-Ahead rule that the optimal policy is the stop treating at $t^*$ the largest integer such that

$\frac{p_{t^*}}{q_{t^*}} + ...+\frac{p_T}{q_T}\geq 1.$

This procedure is called Bruss’ Odds Algorithm.

Optimal stopping in infinite time

We now give conditions for the one step look ahead rule to be optimal for infinite time stopping problems.

Ex 8. [OS:Converge] If the following two conditions hold

$K = \max_{x} k(x) < \infty$
$C= \min_{x} c(x) > 0$

then

$L_s(x) \xrightarrow[s\rightarrow \infty]{} L(x)$

Ans 8. An optimal policy $\pi$ exists by Thrm [IDP:NegBellman]. Suppose that the optimal policy $\pi$ stops at time $\tau$ then

$\begin{aligned} (s+1) C \mathbb P ( \tau > s) \leq \mathbb E \left[ \left\{ \sum_{t=0}^{\tau-1} c(x_{t}) + k(x_{\tau}) \right\} \mathbb I [\tau > s] \right] \leq k(x_0) \leq K.\end{aligned}$

Therefore if we follow optimal policy $\pi$ but for the $s$ time horizon problem and stop at $s$ if $\tau \geq s$ then

$L(x) \leq L_s(x) \leq L(x) + K \mathbb P (\tau > s) \leq L(x) + \frac{K^2}{C(s+1)} \xrightarrow[s\rightarrow\infty]{} L(x)$

Thus $L_s(x) \rightarrow L(x)$ as required.

Ex 9. [continued] Suppose that from the one step lookahead that

$S= \{ x: k(x) \leq c(x) + \mathbb{E}_x[k(\hat{X})]\}$

Then it is optimal to stop if and only if $x\in S$ .

Ans 9. As before (for the finite time problem), it is no optimal to stop if $x\notin S$ and for the finite time problem $L_s(x) = k(x)$ for all $x\in S$ . Therefore by [[OS:Converge]] , since $L_s(x) \rightarrow L(x)$ we have that $L(x)=k(x)$ for all $x\in S$ and there for it is optimal to stop for $x\in S$ .

Ex 10. You own a “toxic” asset its value, $x_t$ at time $t$ , belongs to $\{1,2,3,...\}$ . The daily cost of holding the asset is $x_t$ . Every day the value moves up to $x+1$ with probability $1/2$ or otherwise remains the same at $x$ . Further the cost of terminating the asset after holding it for $t$ days is $C(1-\alpha)^t$ . Find the optimal policy for terminating the asset.

The one step lookahead rule is not always the correct solution to an optimal stopping problem.

Def 3. [Concave Majorant] For a function $r:\{0,...,N\}\rightarrow \mathbb R$ a concave majorant is a function $G$ such that

$G(x)\geq \frac{1}{2} G(x-1) + \frac{1}{2} G(x+1)$
$G(x) \geq r(x)$ .

Ex 11. [Stopping a Random Walk] Let $X_t$ be a symmetric random walk on $\{0,...,N\}$ where the process is automatically stopped at $0$ and $N$ . For each $x\in \{ 0,...,N\}$ , there is a positive reward of $r(x)$ for stopping. We are asked to maximize

$\mathbb E [r(X_T)]$

where $T$ is our chosen stopping time. Show that the optimal value function is a concave majorant.

Ans 11. The Bellman equation is

$R(x) = \max\left\{ r(x), \frac{1}{2} R(x-1) + \frac{1}{2} R(x+1) \right\}$

with $R(x)=r(0)$ and $R(N)=r(N)$ . Thus the optimal value function is a concave majorant.

Ex 12. Show that is $R_{s}(x)$ is the optimal reward starting from $x$ and stopping before $s$ steps (here $R_0(x)=0$ ). Then $R_s(x) \leq G(x)$ for any concave majorant.

Ans 12. We will show that the optimal policy is the minimal concave majorant of $r(x)$ . We do so by, essentially applying induction on value iteration.

First $R_0(x)=0 \leq G(x)$ for any concave majorant of $r(x)$ . Now suppose that $R_{s-1}$ , the function reached after $s-1$ value iterations, satisfies $R_{s-1}(x)\leq G(x)$ for all $x$ , then $\begin{aligned} R_s(x) & = \max\left\{ r(x) , \frac{1}{2} R_{s-1}(x-1) + \frac{1}{2} R_{s-1}(x+1) \right\} \\ & \leq \max\left\{ r(x) , \frac{1}{2} G(x-1) + \frac{1}{2} G(x+1) \right\} \\ & \leq \max\left\{ r(x), G(x) \right\} = G(x).\end{aligned}$

Ex 13. Show that the optimal value function $V(x)$ is the minimal concave majorant, and that it is optimal to stop whenever $V(x)=r(x)$ .

Ans 13. Since value iteration converges $R_s(x) \nearrow V(x)$ , where $V(x)$ satisfies $V(x) \leq G(x)$ , as required. Finally observe that the optimal stopping rule is to stop whenever $V(x)=r(x)$ for the minimal concave majorant.

Optimal stopping in infinite time

Share this:

Leave a comment Cancel reply