# Temporal Difference Learning – Linear Function Approximation

For a Markov chain $\hat{x} = (\hat x_t : t\in\mathbb Z_+)$, consider the reward function

associated with rewards given by $r = (r(x) : x\in\mathcal X)$. We approximate the reward function $R(x)$ with a linear approximation,

Here we have taken our state $x$ and extracted features, $\phi_j(x)$ for $j$ in finite set $\mathcal J$, that we believe to be important to determining the overall reward function $R(x)$. We then apply a vector of weights $w = (w_j : j \in \mathcal J)$ to each of these features. Our job is to find weights that give a good approximation to $R(x)$.

We know for instance that $R(x)$ is a solution to the fixed point equation

The target, $\text{Target}(x)$, is an estimate of the true value of $R(x;w)$ Here the target random variable considered is the TD(0) target. Other targets can be used, e.g. the term in the sum, , would be the Monte-carlo target, and there are various options in between, c.f. TD($\lambda$). In function approximation, we cannot get the expected reward to equal its target. So we attempt to minimize the objective

Here the expectation is over $\mu=(\mu(x):x\in\mathcal X)$, the stationary distribution of our Markov process. We can’t minimize this since we do not know the stationary distribution $\mu$. We can only get samples and so we can instead apply Robbin’s-Munro/Stochastic Gradient Descent update to $w$

Like with tabular methods these updates can be applied online, offline, first-visit, every-visit.

### Quick Analysis of TD(0).

Let’s do an informal analysis of TD(0). For TD(0) our target is $r(x) + \beta R(\hat{x}; w)$, where $\hat x$ is the next state after $x$. Under a linear function approximation this gives an update

Suppose that our Markov chain for $\hat x_t$ is stationary with stationary distribution $\mu(x)$. If we look at the expected change in our update term we get

Above $\Phi$ is the $\mathcal X \times \mathcal J$ matrix with entries $\Phi_{xj} = \phi_j(x)$ and $M$ is the $\mathcal X \times \mathcal X$ diagonal matrix with diagonal entries given by $\mu(x)$ . We use these to define the length $\mathcal J$ vector $b$ and the $\mathcal J \times \mathcal J$ matrix $A$, as defined above.

So, roughly, $w$ moves according to the differential equation

Now $P$ is the transition matrix of a Markov chain. Since its rows sum to $1$, its biggest eigenvalue is $1$. So we can expect that $I - \beta P$ is in some sense “negative”, specifically it can be shown that $v^\top M (I - \beta P) v < - (1-\beta) || v ||^2_\mu$, where $||v||^2_\mu = \mathbb \sum_x \mu(x) v(x)^2$. This then implies that

This is sufficient to give convergence of the above differential equation: take $w^*$ such that $A w^* = b$ and take $L(w) = \frac{1}{2} || w- w^*||^2$ then

Thus we see that $w(t) \rightarrow w^*$.

Convergence is great and everything, but we must verify that the solution obtained, $w^*$, is a “good” solution. First, notice that the reward function $R= (R(x) : x\in \mathcal X)$ satisfies

This is just with $R$ interpreted as a vector and the expectation as a matrix operation with respect to transition matrix $P$.

Second, notice the approximation $R (w) = ( R(x;w) : x\in\mathcal X)$ that is closest to the rewards $R$, is given by a projection, specifically

Third, we see the equations satisfied by $w^*$ can be expressed in a form somewhat similar to above expression $R=T_0(R)$. Specifically, we rearrange the expression $Aw^* =b$

So, while $R$ satisfies $R = T_0(R)$, we see that $R(w^*)$ satisfies

We can use these identities satisfied by $R$ and $R (w^*)$ to show that approximation is comparable to the best approximation of $R$. Since $T_0$ is a $\beta$-contraction and that projections always move distances closer (both properties are relatively easy to verify):

So

Thus we see we reach an approximation that is a constant away from the “best” linear approximation.

References.

The exposition here is due to Van Roy and Tsitsiklis.

Tsitsiklis, John N., and Benjamin Van Roy. “Analysis of temporal-diffference learning with function approximation.” Advances in neural information processing systems. 1997.