Temporal Difference Learning – Linear Function Approximation

For a Markov chain $\hat{x} = (\hat x_t : t\in\mathbb Z_+)$ , consider the reward function

$\label{TD:Reward} R(x) := \mathbb E_{x} \left[ \sum_{t=0}^\infty r(\hat x_t) \right]$

associated with rewards given by $r = (r(x) : x\in\mathcal X)$ . We approximate the reward function $R(x)$ with a linear approximation,

$R(x;\bm w) = \bm w^\top \bm \phi (x) = \sum_{j\in \mathcal J} w_j \phi_j(x) .$

Here we have taken our state $x$ and extracted features, $\phi_j(x)$ for $j$ in finite set $\mathcal J$ , that we believe to be important to determining the overall reward function $R(x)$ . We then apply a vector of weights $w = (w_j : j \in \mathcal J)$ to each of these features. Our job is to find weights that give a good approximation to $R(x)$ .

We know for instance that $R(x)$ is a solution to the fixed point equation

$R(x) = \mathbb E_x [ \underbrace{ r(x) + \beta R(\hat{x}) }_{ =:\text{Target}(x) }],\qquad x\in \mathcal X. \label{TD:REqn}$

The target, $\text{Target}(x)$ , is an estimate of the true value of $R(x;w)$ Here the target random variable considered is the TD(0) target. Other targets can be used, e.g. the term in the sum, , would be the Monte-carlo target, and there are various options in between, c.f. TD( $\lambda$ ). In function approximation, we cannot get the expected reward to equal its target. So we attempt to minimize the objective

$\underset{\bm w}{\text{minimize}}\quad \mathbb E_\mu \left[ \left( \text{Target}(x) - R(x;\bm w) \right)^2 \right]\, .$

Here the expectation is over $\mu=(\mu(x):x\in\mathcal X)$ , the stationary distribution of our Markov process. We can’t minimize this since we do not know the stationary distribution $\mu$ . We can only get samples and so we can instead apply Robbin’s-Munro/Stochastic Gradient Descent update to $w$

$\bm w\leftarrow \bm w + \alpha \big( \text{Target}(x) - R(x; \bm w) \big) \nabla_{\bm w} R(x;\bm w)\, .$

Like with tabular methods these updates can be applied online, offline, first-visit, every-visit.

Quick Analysis of TD(0).

Let’s do an informal analysis of TD(0). For TD(0) our target is $r(x) + \beta R(\hat{x}; w)$ , where $\hat x$ is the next state after $x$ . Under a linear function approximation this gives an update

$\bm w \leftarrow \bm w + \alpha \big( r(x) + \beta \bm w^\top \bm \phi (\hat x) - \bm w^\top \bm \phi(x) \big) \bm \phi (x) \, .$

Suppose that our Markov chain for $\hat x_t$ is stationary with stationary distribution $\mu(x)$ . If we look at the expected change in our update term we get

$\begin{aligned} &\sum_x \mu(x) r(x) \bm\phi(x) + \sum _x \sum_y \mu(x) ( \beta \bm \phi(x) P_{xy} \bm \phi(y)^\top -\bm \phi(x) \bm \phi(x)^\top )\bm w \\ = & \underbrace{\Phi^\top M \bm r }_{=: \bm b} - \underbrace{ \Phi^\top M [I-\beta P] \Phi }_{=: A}\bm w\, .\end{aligned}$

Above $\Phi$ is the $\mathcal X \times \mathcal J$ matrix with entries $\Phi_{xj} = \phi_j(x)$ and $M$ is the $\mathcal X \times \mathcal X$ diagonal matrix with diagonal entries given by $\mu(x)$ . We use these to define the length $\mathcal J$ vector $b$ and the $\mathcal J \times \mathcal J$ matrix $A$ , as defined above.

So, roughly, $w$ moves according to the differential equation

$\frac{d \bm w}{d t} = - A \bm w + \bm b\, .$

Now $P$ is the transition matrix of a Markov chain. Since its rows sum to $1$ , its biggest eigenvalue is $1$ . So we can expect that $I - \beta P$ is in some sense “negative”, specifically it can be shown that $v^\top M (I - \beta P) v < - (1-\beta) || v ||^2_\mu$ , where $||v||^2_\mu = \mathbb \sum_x \mu(x) v(x)^2$ . This then implies that

$\bm v^\top A \bm v \geq (1-\beta) || \bm v||_\mu^2.$

This is sufficient to give convergence of the above differential equation: take $w^*$ such that $A w^* = b$ and take $L(w) = \frac{1}{2} || w- w^*||^2$ then

$\frac{d L}{dt} = \nabla L (\bm w) \cdot \frac{d \bm w}{dt} = - (\bm w-\bm w^*)^\top A (\bm w-\bm w^*) \leq -(1-\beta) || \bm w-\bm w^*||_\mu^2.$

Thus we see that $w(t) \rightarrow w^*$ .

Convergence is great and everything, but we must verify that the solution obtained, $w^*$ , is a “good” solution. First, notice that the reward function $R= (R(x) : x\in \mathcal X)$ satisfies

$\bm R = T_0 ( \bm R ),\qquad \text{where}\qquad T_0(\bm R) := \bm r +\beta P \bm R\, .$

This is just with $R$ interpreted as a vector and the expectation as a matrix operation with respect to transition matrix $P$ .

Second, notice the approximation $R (w) = ( R(x;w) : x\in\mathcal X)$ that is closest to the rewards $R$ , is given by a projection, specifically

$\Pi (\bm R) := \Phi ( \Phi^\top M \Phi )^{-1} \Phi^\top M \bm R\, = \argmin_{\bm R(\bm w)} || \bm R - \bm R(\bm w) ||_\mu^2\, .$

Third, we see the equations satisfied by $w^*$ can be expressed in a form somewhat similar to above expression $R=T_0(R)$ . Specifically, we rearrange the expression $Aw^* =b$

$\begin{aligned} A \bm w^* = \bm b & \iff \Phi^\top M [I-\beta P] \Phi \bm w^*\, = {\Phi^\top M \bm r } \\ & \iff \Phi^\top M \Phi \bm w^*\, = {\Phi^\top M \bm r } + \Phi^\top M \beta P \Phi \bm w^* \\ & \iff \underbrace{ \Phi \bm w^* }_{ \bm R(\bm w^*) } = \underbrace{ \Phi [\Phi^\top M \Phi ]^{-1} {\Phi^\top M } }_{ \Pi } \underbrace{ (\bm r + \beta P \Phi \bm w^*) }_{ T_0(\bm R^*(\bm w)) }\, .\end{aligned}$

So, while $R$ satisfies $R = T_0(R)$ , we see that $R(w^*)$ satisfies

$\bm R(\bm w^*) = \Pi( T_0(\bm R(\bm w^*))\, .$

We can use these identities satisfied by $R$ and $R (w^*)$ to show that approximation is comparable to the best approximation of $R$ . Since $T_0$ is a $\beta$ -contraction and that projections always move distances closer (both properties are relatively easy to verify):

Screenshot 2019-04-26 at 21.53.21.png
So

$|| \bm R - \bm R(\bm w^*) ||_\mu \leq \frac{1}{1-\beta} || \bm R - \Pi( \bm R ) ||_\mu\, .$

Thus we see we reach an approximation that is a constant away from the “best” linear approximation.

References.

The exposition here is due to Van Roy and Tsitsiklis.

Tsitsiklis, John N., and Benjamin Van Roy. “Analysis of temporal-diffference learning with function approximation.” Advances in neural information processing systems. 1997.

Quick Analysis of TD(0).

Share this:

Leave a comment Cancel reply