Experts and Bandits (non-stochastic)

  • Weighted majority algorithm its variant for Bandit Problems.

We consider the problem of learning the ‘best’ action amongst a fixed reference set. Although our modelling assumptions will be quite different, we follow notation is somewhat similar to Section [DP]. We consider the following setting:

Consider actions a\in {\mathcal A} and outcomes \omega \in \Omega. After choosing an action a, an outcome \omega occurs and you receives a reward r(a;\omega)\in [0,1]. We assume that the set of actions in of size N.

Def [Policy] Over time outcomes \omega_t, t=1,...,T occur. We do not (currently) assume that there is an underlying stochastic process determining the values of \omega_t; they may be chosen arbitrarily.

At each time t, a policy \Pi chooses an action \pi_t as a function of the past outcomes \omega_s, s=1,...,t-1 and their rewards r(\cdot;\omega_s)

In other words at each time the past states and rewards are known the policy. The policy then must do the best it can to accumulate rewards.

Def [Rewards/Regret] The total reward by time T of policy \Pi is

The regret of policy \Pi is

  • It is good to think of outcome \omega_t as an outcome in a probability space. (Indeed sometimes we will assume this.)
  • It is good to think of rewards R_T(\Pi) in the same way as we consider rewards for an MDP R_T(x,\Pi) (Though since there is no state we do not require notation for x).
  • The regret compare how well our policy compares with the best fixed reward. I.e. retrospectively, we “regret” not making the best fixed choice.
  • A regret of zero as T\rightarrow \infty means we do as well as the best fixed choice. This is sometimes called Hannan consistency.

We are interested in how a policy \Pi performs in comparison to each fixed policy a \in {\mathcal A} , that is a policy that chooses the same action a at each time. Notice this is a weaker assumption compared to finding the best of all policies. This weaker notion of optimality is required because we are making much weaker assumptions about the evolution of the process \omega_t.

Ex 1. [A Regret Lower-Bound] Consider that there are two outcomes x=A, B and two actions a=A,B where [r(A,A),r(A,B)] = [1,0] and [r(A,B),r(B,B)] = [0,1]. I.e. There is an action to go to A or B and if you match the outcome you get a reward of 1, otherwise the reward is 0.

Suppose that each \omega_t is chosen uniformly at random from \{A,B\} and is independent. Show that, regardless of what policy is used,

for some constant C. [Hint: Use the central limit theorem.]

Ans 1. Note that, since rewards are IID with probability \frac{1}{2} for any policy \Pi we have that

Further note that R_T(A) + R_T(B)=T thus

We know that \epsilon_T is the error of an iid sequence about its mean so the central limit theorem applies convergence in distribution

and

These last to observations give the result.

The Weighted Majority Algorithm

We now consider a policy, that can achieve the lower bound suggested by [[OL:Ran]].

We prove the following result for the Weighted Majority Algorithm.

Thrm 1

and, \eta=\sqrt{2T^{-1}\log(N)},

Other related results are also possible. We also require the following probability bound:

Lemma [Hoeffding’s Inequality] For a Random variable bounded on

Ex 2 Prove that

Ans 2 

Ex 3 [Continued] Show that

Ans 3 Apply Hoeffding’s Inequality.

Ex 4 [Continued]

Ans 4 Trivial from definitions.

Ex 5 [Continued]

Ans 5 Combine [3] and [4], take logs and rearrange.

Ex 6 [Continued] Finally show that

Ans 6 Minimize [5] over \eta.

Multi-armed Bandits: Non-Stochastic

We consider a multi-armed bandit setting for online learning.

Def [Bandit Policy] A bandit policy is a policy, cf. [[OL:Policy]], where \pi_t is a function of the previous actions chosen a_s, s=1,...,t-1 and their costs c(a_s; \omega_t). I.e. you cannot observe what cost would have happened if you had chosen a different action.

We consider a version of the weighted majority algorithm for this multi-armed bandit problem.

Note here that, since the policy choice \pi_t is random, \hat{c}_t(a ; \omega_t) is a random variable. We show the following

 

and for appropriate \eta

First we collect together a few facts then the proof proceeds very similar to the weight majority algorithm proof.

Ex 7 Show that for c\geq 0

(We do the result for losses rather than rewards because of this bound.1)

Ans 7 Follows as 4th term in the Taylor expansion is positive.

Ex 8 Show that

Ans 8

Ex 9 Show that

Ans 9 Similar to [8],

Ex 10 Show that

(Note: here we are not taking an expectation with respect to \pi_t but we are just taking a sum weighted by P_t(a).)

Ans 10

Ex 11 Show that

Ans 11 This is the same as [2-3] but use [7] at the last step instead of Hoeffding’s Inequality. Specifically,

Now multiply for t=1,...,T.

Ex 12 Show that for all a

Ans 12 Similar to [4] ,

Now take logs.

Ex 13 Show that

Ans 13 Combine inequalities [11] and [12] and taking logs gives

and rearrange. (Note: you need to use that \log (1 + x) \leq x for all x.)

Ex 14 Show that

Ans 14 Take expectations on both sides of [13]

(Here the first “[??]” is Ex 8 the second is Ex 9) Now minimize the lefthand-side to give the regret.

Ex 15. Show that, for an appropriate choice of \eta,

Ans 15. Now minimize over \eta the righthand bound in [14].


  1. We could modify the policy for rewards by un-elegantly subtracting from the max reward.

Leave a comment