Q-learning is an algorithm, that contains many of the basic structures required for reinforcement learning and acts as the basis for many more sophisticated algorithms. The Q-learning algorithm can be seen as an (asynchronous) implementation of the Robbins-Monro procedure for finding fixed points. For this reason we will require results from Robbins-Monro when proving convergence.

A key ingredient is the notion of a Q-factor as described in Section [IDP]. Recall that optimal Q-factor, Q(x,a), is the value of starting in state x taking action a and thereafter following the optimal policy. In Infinite Time Horizon, Prop 2 we showed that this solved the recursion:

Def. [Q-learning] Given a state x, an action a, its reward r(x,a) and the next state \hat{x}, Q-learning performs the update

where \alpha positive (learning rate) parameter. Recall x \xleftarrow[]{\alpha} dx means reset x with x' such that x' = x + \alpha dx.

To implement this as an algorithm, we assume that we have a sequence of state-action-reward-next_state quadruplets \{ (x_t,a_t,r_t,\hat{x}_t) \}_{t=0}^\infty and we apply the above update to each of the terms in this sequence.

Thrm 1. For a sequence of state-action-reward triples \{ (x_t,a_t,r_t,\hat{x}_t) \}_{t=0}^\infty Consider the Q-learning update for (x,a,r,\hat{x}) = (x_t,a_t,r_t,\hat{x}_t)

if the sequence of state-action-reward triples visits each state and action infinitely often, and if the learning rate \alpha_t(x,a) is an adapted sequence satisfying the Robbins-Monro condition

then, with probability 1,

where Q^*(x,a) is the optimal value function.

Proof. We essentially show that the result is a consequence of Theorem 3 in Robbins-Monro. We note that the optimal Q-function, \mathbf Q = (Q(x,a) : x\in \mathcal X, a\in \mathcal A) satisfies a fixed point equation


Screenshot 2019-01-26 at 17.23.06.png

for each x\in \mathcal X and a\in \mathcal A. We know from Prop 2 in the post Infinite Time Horizon, that for discounted programming \mathbf F (\cdot) is a contraction. I.e.

Now notice that the Q-learning algorithm performs the update


for (x_t,a_t,r_t,\hat x_t )=(x,a,r,\hat{x}). The update above is a Robbin’s Monro update. Further Notice Q(x',a') remains the same for all other values of x,a, the update is asynchronous. It is not hard to see that when we condition on \mathcal F_t the set of previous actions and states that

and, a quick calculation shows,1 that

From this we see that we are working in the setting of Theorem 3 in Robbins-Monro and that the conditions of that theorem are satisfied. Thus it must be that

where Q^*(x,a) satisfies \mathbf Q^* = \mathbf F(\mathbf Q^*). In otherwords, as required, it satisfies the Bellman equation for the optimal Q-function and thus is optimal. \square


The argument here is depends heavily on the fixed point results. But the main source used is

Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” Machine learning 16.3 (1994): 185-202.


  1. Note (x+y)^2 \leq 2 x^2+2 y^2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: