- Weighted majority algorithm its variant for Bandit Problems.
We consider the problem of learning the ‘best’ action amongst a fixed reference set. Although our modelling assumptions will be quite different, we follow notation is somewhat similar to Section [DP]. We consider the following setting:
Consider actions and outcomes . After choosing an action , an outcome occurs and you receives a reward . We assume that the set of actions in of size .
Def [Policy] Over time outcomes , occur. We do not (currently) assume that there is an underlying stochastic process determining the values of ; they may be chosen arbitrarily.
At each time , a policy chooses an action as a function of the past outcomes , and their rewards
In other words at each time the past states and rewards are known the policy. The policy then must do the best it can to accumulate rewards.
Def [Rewards/Regret] The total reward by time of policy is
The regret of policy is
- It is good to think of outcome as an outcome in a probability space. (Indeed sometimes we will assume this.)
- It is good to think of rewards in the same way as we consider rewards for an MDP (Though since there is no state we do not require notation for ).
- The regret compare how well our policy compares with the best fixed reward. I.e. retrospectively, we “regret” not making the best fixed choice.
- A regret of zero as means we do as well as the best fixed choice. This is sometimes called Hannan consistency.
We are interested in how a policy performs in comparison to each fixed policy , that is a policy that chooses the same action at each time. Notice this is a weaker assumption compared to finding the best of all policies. This weaker notion of optimality is required because we are making much weaker assumptions about the evolution of the process .
Ex 1. [A Regret Lower-Bound] Consider that there are two outcomes and two actions where and . I.e. There is an action to go to or and if you match the outcome you get a reward of , otherwise the reward is .
Suppose that each is chosen uniformly at random from and is independent. Show that, regardless of what policy is used,
for some constant . [Hint: Use the central limit theorem.]
Ans 1. Note that, since rewards are IID with probability for any policy we have that
Further note that thus
We know that is the error of an iid sequence about its mean so the central limit theorem applies convergence in distribution
These last to observations give the result.
The Weighted Majority Algorithm
We now consider a policy, that can achieve the lower bound suggested by [[OL:Ran]].
We prove the following result for the Weighted Majority Algorithm.
Other related results are also possible. We also require the following probability bound:
Lemma [Hoeffding’s Inequality] For a Random variable bounded on
Ex 2 Prove that
Ex 3 [Continued] Show that
Ans 3 Apply Hoeffding’s Inequality.
Ex 4 [Continued]
Ans 4 Trivial from definitions.
Ex 5 [Continued]
Ans 5 Combine  and , take logs and rearrange.
Ex 6 [Continued] Finally show that
Ans 6 Minimize  over .
Multi-armed Bandits: Non-Stochastic
We consider a multi-armed bandit setting for online learning.
Def [Bandit Policy] A bandit policy is a policy, cf. [[OL:Policy]], where is a function of the previous actions chosen , and their costs . I.e. you cannot observe what cost would have happened if you had chosen a different action.
We consider a version of the weighted majority algorithm for this multi-armed bandit problem.
Note here that, since the policy choice is random, is a random variable. We show the following
and for appropriate
First we collect together a few facts then the proof proceeds very similar to the weight majority algorithm proof.
Ex 7 Show that for
(We do the result for losses rather than rewards because of this bound.1)
Ans 7 Follows as 4th term in the Taylor expansion is positive.
Ex 8 Show that
Ex 9 Show that
Ans 9 Similar to ,
Ex 10 Show that
(Note: here we are not taking an expectation with respect to but we are just taking a sum weighted by .)
Ex 11 Show that
Ans 11 This is the same as [2-3] but use  at the last step instead of Hoeffding’s Inequality. Specifically,
Now multiply for .
Ex 12 Show that for all
Ans 12 Similar to  ,
Now take logs.
Ex 13 Show that
Ans 13 Combine inequalities  and  and taking logs gives
and rearrange. (Note: you need to use that for all .)
Ex 14 Show that
Ans 14 Take expectations on both sides of 
(Here the first “[??]” is Ex 8 the second is Ex 9) Now minimize the lefthand-side to give the regret.
Ex 15. Show that, for an appropriate choice of ,
Ans 15. Now minimize over the righthand bound in .
- We could modify the policy for rewards by un-elegantly subtracting from the max reward.↩