Temporal Difference Learning – Linear Function Approximation

For a Markov chain \hat{x} = (\hat x_t : t\in\mathbb Z_+), consider the reward function

associated with rewards given by r = (r(x) : x\in\mathcal X). We approximate the reward function R(x) with a linear approximation,

Here we have taken our state x and extracted features, \phi_j(x) for j in finite set \mathcal J, that we believe to be important to determining the overall reward function R(x). We then apply a vector of weights w = (w_j : j \in \mathcal J) to each of these features. Our job is to find weights that give a good approximation to R(x).

We know for instance that R(x) is a solution to the fixed point equation

The target, \text{Target}(x), is an estimate of the true value of R(x;w) Here the target random variable considered is the TD(0) target. Other targets can be used, e.g. the term in the sum, , would be the Monte-carlo target, and there are various options in between, c.f. TD(\lambda). In function approximation, we cannot get the expected reward to equal its target. So we attempt to minimize the objective

Here the expectation is over \mu=(\mu(x):x\in\mathcal X), the stationary distribution of our Markov process. We can’t minimize this since we do not know the stationary distribution \mu. We can only get samples and so we can instead apply Robbin’s-Munro/Stochastic Gradient Descent update to w

Like with tabular methods these updates can be applied online, offline, first-visit, every-visit.

Quick Analysis of TD(0).

Let’s do an informal analysis of TD(0). For TD(0) our target is r(x) + \beta R(\hat{x}; w), where \hat x is the next state after x. Under a linear function approximation this gives an update

Suppose that our Markov chain for \hat x_t is stationary with stationary distribution \mu(x). If we look at the expected change in our update term we get

Above \Phi is the \mathcal X \times \mathcal J matrix with entries \Phi_{xj} = \phi_j(x) and M is the \mathcal X \times \mathcal X diagonal matrix with diagonal entries given by \mu(x) . We use these to define the length \mathcal J vector b and the \mathcal J \times \mathcal J matrix A, as defined above.

So, roughly, w moves according to the differential equation

Now P is the transition matrix of a Markov chain. Since its rows sum to 1, its biggest eigenvalue is 1. So we can expect that I - \beta P is in some sense “negative”, specifically it can be shown that v^\top M (I - \beta P) v < - (1-\beta) || v ||^2_\mu, where ||v||^2_\mu = \mathbb \sum_x \mu(x) v(x)^2 . This then implies that

This is sufficient to give convergence of the above differential equation: take w^* such that A w^* = b and take L(w) = \frac{1}{2} || w- w^*||^2 then

Thus we see that w(t) \rightarrow w^*.

Convergence is great and everything, but we must verify that the solution obtained, w^*, is a “good” solution. First, notice that the reward function R= (R(x) : x\in \mathcal X) satisfies

This is just with R interpreted as a vector and the expectation as a matrix operation with respect to transition matrix P.

Second, notice the approximation R (w) = ( R(x;w) : x\in\mathcal X) that is closest to the rewards R, is given by a projection, specifically

Third, we see the equations satisfied by w^* can be expressed in a form somewhat similar to above expression R=T_0(R). Specifically, we rearrange the expression Aw^* =b

So, while R satisfies R = T_0(R) , we see that R(w^*) satisfies

We can use these identities satisfied by R and R (w^*) to show that approximation is comparable to the best approximation of R. Since T_0 is a \beta-contraction and that projections always move distances closer (both properties are relatively easy to verify):

Screenshot 2019-04-26 at 21.53.21.png
So

Thus we see we reach an approximation that is a constant away from the “best” linear approximation.

References.

The exposition here is due to Van Roy and Tsitsiklis.

Tsitsiklis, John N., and Benjamin Van Roy. “Analysis of temporal-diffference learning with function approximation.” Advances in neural information processing systems. 1997.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: