Q-learning is an algorithm, that contains many of the basic structures required for reinforcement learning and acts as the basis for many more sophisticated algorithms. The Q-learning algorithm can be seen as an (asynchronous) implementation of the Robbins-Monro procedure for finding fixed points. For this reason we will require results from Robbins-Monro when proving convergence.
A key ingredient is the notion of a -factor as described in Section [IDP]. Recall that optimal
-factor,
, is the value of starting in state
taking action
and thereafter following the optimal policy. In Infinite Time Horizon, Prop 2 we showed that this solved the recursion:
Def. [Q-learning] Given a state , an action
, its reward
and the next state
,
-learning performs the update
where positive (learning rate) parameter. Recall
means reset
with
such that
.
To implement this as an algorithm, we assume that we have a sequence of state-action-reward-next_state quadruplets and we apply the above update to each of the terms in this sequence.
Thrm 1. For a sequence of state-action-reward triples Consider the Q-learning update for
if the sequence of state-action-reward triples visits each state and action infinitely often, and if the learning rate is an adapted sequence satisfying the Robbins-Monro condition
then, with probability ,
where is the optimal value function.
Proof. We essentially show that the result is a consequence of Theorem 3 in Robbins-Monro. We note that the optimal -function,
satisfies a fixed point equation
with
for each and
. We know from Prop 2 in the post Infinite Time Horizon, that for discounted programming
is a contraction. I.e.
Now notice that the -learning algorithm performs the update
where
for . The update above is a Robbin’s Monro update. Further Notice
remains the same for all other values of
, the update is asynchronous. It is not hard to see that when we condition on
the set of previous actions and states that
and, a quick calculation shows,1 that
From this we see that we are working in the setting of Theorem 3 in Robbins-Monro and that the conditions of that theorem are satisfied. Thus it must be that
where satisfies
. In otherwords, as required, it satisfies the Bellman equation for the optimal
-function and thus is optimal.
Literature.
The argument here is depends heavily on the fixed point results. But the main source used is
Tsitsiklis, John N. “Asynchronous stochastic approximation and Q-learning.” Machine learning 16.3 (1994): 185-202.
- Note
↩