For a Markov chain , consider the reward function

associated with rewards given by . We approximate the reward function with a linear approximation,

Continue reading “Temporal Difference Learning – Linear Function Approximation”

Skip to content
## Temporal Difference Learning – Linear Function Approximation

## Mixing Times

## Bayesian Online Learning

## Stochastic Linear Regression

## Q-learning

## Robbins-Monro

## Merton Portfolio Optimization

For a Markov chain , consider the reward function

associated with rewards given by . We approximate the reward function with a linear approximation,

Continue reading “Temporal Difference Learning – Linear Function Approximation”

In loose terms, the mixing time is the amount of time to wait before you can expect a Markov chain to be close to its stationary distribution. We give an upper bound for this.

We briefly describe an Online Bayesian Framework which is sometimes referred to as Assumed Density Filer (ADF). And we review a heuristic proof of its convergence in the Gaussian case.

We consider the following formulation of Lai, Robbins and Wei (1979), and Lai and Wei (1982). Consider the following regression problem,

for where are unobservable random errors and are unknown parameters.

Typically for a regression problem, it is assumed that inputs are given and errors are IID random variables. However, we now want to consider a setting where we sequentially choose inputs and then get observations , and errors are a martingale difference sequence with respect to the filtration generated by .

Q-learning is an algorithm, that contains many of the basic structures required for reinforcement learning and acts as the basis for many more sophisticated algorithms. The Q-learning algorithm can be seen as an (asynchronous) implementation of the Robbins-Monro procedure for finding fixed points. For this reason we will require results from Robbins-Monro when proving convergence.

We review a method for finding fixed points then extend it to slightly more general, modern proofs. This is a much more developed version of an earlier post. We now cover the basic Robbin-Monro proof, Robbins-Siegmund Theorem, Stochastic Gradient Descent and Asynchronous update (as is required for Q-learning).

- HJB equation for Merton Problem; CRRA utility solution; Proof of Optimality.
- Multiple Assets; Dual Value function Approach.