# Infinite Time Horizon

Thus far we have considered finite time Markov decision processes. We now want to solve MDPs of the form We can generalize Bellman’s equation to infinite time, a correct guess at the form of the equation would, for instance, be Previously we solved Markov Decisions Processes inductively with Bellman’s equation. In infinite time, we can not directly apply induction; however, we see that Bellman’s equation still holds and we can use this to solve our MDP. For now we will focus on the case of discounted programming: here we assume that We will cover cases where $\beta=1$ later.

At this point it is useful define the concept of a $Q$-factor. A Q-factor of a policy $\pi$ is the reward that arises when we take action $a$ from state $x$ and then follow policy $\pi$.

Def. [Q-Factor] The $Q$-factor of reward function $R(\cdot)$ is the value for taking action $a$ in state $x$ Similarly the $Q$-factor for a policy $\pi$, denoted by $Q_{\pi}(x,a)$, is given by the above expression with $R(x) = R(x,\pi)$. The Q-factor of the optimal policy is given by The following result shows that if we have solved the Bellman equation then the solution and its associated policy is optimal.

Thrm. For a discounted program, the optimal policy $V(x)$ satisfies Moreover, if we find a function $R(x)$ such that then $R(x) = V(x)$, i.e. the solution to the Bellman equation is unique, and we find a function $\pi(x)$ such that Then $\pi$ is optimal and $R(x,\pi)=R(x)=V(x)$ the optimal value function.

Proof. We know that $R_t(x,\Pi) = r(x,\pi_0) + \beta \mathbb E [ R_{t-1}(\hat{X},\hat{ \Pi} ) ].$ Applying limits as $t\rightarrow \infty$ on both sides and bounded convergence theorem gives that For the inequality, above, we maximize $R(\hat{X},\hat{\Pi})$ over $\hat{\Pi}$. Now maximizing the left hand side over $\Pi$ gives At this point we have that the Bellman equation but with an inequality. We need to prove the inequality in the other direction. For this, we let $\pi_\epsilon$ be the policy that chooses action $a$ and then, from the next state $\hat{X}$, follows a policy $\hat{\Pi}_\epsilon$ which satisfies We have that The first inequality holds by the sub-optimality of $\Pi_\epsilon$ and the second holds by the assumption on $\hat{\Pi}_\epsilon$. Maximizing over $a\in \mathcal A$, and taking $\epsilon\rightarrow 0$ gives Thus we now have that So at this point we know that the optimal value function satisfies the Bellman equation. For the next part of the result we need to show that the solution to this recursion is unique.

Suppose that $R(x)$ is another solution to the Bellman equation. From the definition of a $Q$-factor and the Bellman recursion, $R(x) = \max_a Q_R(x,a)$ and $V(x) = \max_a Q_V(x,a)$. Thus note that Thus Since $0<\beta < 1$ the only solution to this inequality is $Q_V = Q_R$ and thus So solutions to the Bellman equation are unique for discounted programming. Finally we must show that if we can find a policy that solves the Bellman equation, then it is optimal.

If we find a function $R(x)$ and a function $\pi(x)$ such that then note that the MDP induced by $\pi$ is a Markov chain (with transition matrix $P^{\pi(x)}_{xy}$). Both $R(x,\pi)$ and $R(x)$ solve the equation $R(x) = r(x,\pi(x)) + \beta \mathbb E_{x,\pi(x)} [ R(\hat{X})]$. So by Prop 2 in Markov Chains: A Quick Review, $R(x)=R(x,\pi)$. $\square$

It is worth collating together a similar result for $Q$-factors. Given the facts accrued about value function and Bellman’s equation. The following Proposition should not be too great a surprise (and can be skipped on first reading).

Prop. a) Stationary $Q$-factors satisfy the recursion b) Bellman’s Equation can be re-expressed in terms of $Q$-factors as follows The optimal value function satisfies c) The operation is a contraction with respect to the supremum norm, that is, Proof. a) We can think of extending the state space of our MDP to include states $\mathcal{X}_0 = \{ (x,a) : x\in \mathcal{X}, a\in \mathcal{A}\}$ as well as $\mathcal X$. In this new MDP we can assume that initially the MDP starts in state $(x,a)$ then moves to the state $\hat X \in \mathcal X$ according to the transition probabilities $P_{x\hat x}^a$. There after it remains in $\mathcal X$ moving according to policy $\pi$. Thus by Prop 2 in Markov Chains: A Quick Review where $R(x,\pi)$ is the reward function of policy $\pi$ . Further since $Q_\pi(x,a)$ is the value from taking $a$ instead of following policy $\pi$ to should also be clear that Thus, as required, b) Further it should be clear that the optimal value function for the extended MDP discussed has a Bellman equation of the form Comparing the first equation above with the second, it should be clear that $V(x) = \max_a Q^*(x,a)$ and substituting this back into the first equation gives as required c) The proof of this part is already embedded in the previous Theorem. Note that Thus as required. $\square$