For a Markov chain , consider the reward function

associated with rewards given by . We approximate the reward function with a linear approximation,

Here we have taken our state and extracted features, for in finite set , that we believe to be important to determining the overall reward function . We then apply a vector of weights to each of these features. Our job is to find weights that give a good approximation to .

We know for instance that is a solution to the fixed point equation

The target, , is an estimate of the true value of Here the target random variable considered is the TD(0) target. Other targets can be used, e.g. the term in the sum, , would be the Monte-carlo target, and there are various options in between, c.f. TD(). In function approximation, we cannot get the expected reward to equal its target. So we attempt to minimize the objective

Here the expectation is over , the stationary distribution of our Markov process. We can’t minimize this since we do not know the stationary distribution . We can only get samples and so we can instead apply Robbin’s-Munro/Stochastic Gradient Descent update to

Like with tabular methods these updates can be applied online, offline, first-visit, every-visit.

### Quick Analysis of TD(0).

Let’s do an informal analysis of TD(0). For TD(0) our target is , where is the next state after . Under a linear function approximation this gives an update

Suppose that our Markov chain for is stationary with stationary distribution . If we look at the expected change in our update term we get

Above is the matrix with entries and is the diagonal matrix with diagonal entries given by . We use these to define the length vector and the matrix , as defined above.

So, roughly, moves according to the differential equation

Now is the transition matrix of a Markov chain. Since its rows sum to , its biggest eigenvalue is . So we can expect that is in some sense “negative”, specifically it can be shown that , where . This then implies that

This is sufficient to give convergence of the above differential equation: take such that and take then

Thus we see that .

Convergence is great and everything, but we must verify that the solution obtained, , is a “good” solution. First, notice that the reward function satisfies

This is just with interpreted as a vector and the expectation as a matrix operation with respect to transition matrix .

Second, notice the approximation that is closest to the rewards , is given by a projection, specifically

Third, we see the equations satisfied by can be expressed in a form somewhat similar to above expression . Specifically, we rearrange the expression

So, while satisfies , we see that satisfies

We can use these identities satisfied by and to show that approximation is comparable to the best approximation of . Since is a -contraction and that projections always move distances closer (both properties are relatively easy to verify):

So

Thus we see we reach an approximation that is a constant away from the “best” linear approximation.

**References.**

The exposition here is due to Van Roy and Tsitsiklis.

Tsitsiklis, John N., and Benjamin Van Roy. “Analysis of temporal-diffference learning with function approximation.” *Advances in neural information processing systems*. 1997.