For a Markov chain , consider the reward function
associated with rewards given by . We approximate the reward function
with a linear approximation,
Here we have taken our state and extracted features,
for
in finite set
, that we believe to be important to determining the overall reward function
. We then apply a vector of weights
to each of these features. Our job is to find weights that give a good approximation to
.
We know for instance that is a solution to the fixed point equation
The target, , is an estimate of the true value of
Here the target random variable considered is the TD(0) target. Other targets can be used, e.g. the term in the sum, , would be the Monte-carlo target, and there are various options in between, c.f. TD(
). In function approximation, we cannot get the expected reward to equal its target. So we attempt to minimize the objective
Here the expectation is over , the stationary distribution of our Markov process. We can’t minimize this since we do not know the stationary distribution
. We can only get samples and so we can instead apply Robbin’s-Munro/Stochastic Gradient Descent update to
Like with tabular methods these updates can be applied online, offline, first-visit, every-visit.
Quick Analysis of TD(0).
Let’s do an informal analysis of TD(0). For TD(0) our target is , where
is the next state after
. Under a linear function approximation this gives an update
Suppose that our Markov chain for is stationary with stationary distribution
. If we look at the expected change in our update term we get
Above is the
matrix with entries
and
is the
diagonal matrix with diagonal entries given by
. We use these to define the length
vector
and the
matrix
, as defined above.
So, roughly, moves according to the differential equation
Now is the transition matrix of a Markov chain. Since its rows sum to
, its biggest eigenvalue is
. So we can expect that
is in some sense “negative”, specifically it can be shown that
, where
. This then implies that
This is sufficient to give convergence of the above differential equation: take such that
and take
then
Thus we see that .
Convergence is great and everything, but we must verify that the solution obtained, , is a “good” solution. First, notice that the reward function
satisfies
This is just with interpreted as a vector and the expectation as a matrix operation with respect to transition matrix
.
Second, notice the approximation that is closest to the rewards
, is given by a projection, specifically
Third, we see the equations satisfied by can be expressed in a form somewhat similar to above expression
. Specifically, we rearrange the expression
So, while satisfies
, we see that
satisfies
We can use these identities satisfied by and
to show that approximation is comparable to the best approximation of
. Since
is a
-contraction and that projections always move distances closer (both properties are relatively easy to verify):
So
Thus we see we reach an approximation that is a constant away from the “best” linear approximation.
References.
The exposition here is due to Van Roy and Tsitsiklis.
Tsitsiklis, John N., and Benjamin Van Roy. “Analysis of temporal-diffference learning with function approximation.” Advances in neural information processing systems. 1997.