# Experts and Bandits (non-stochastic)

• Weighted majority algorithm its variant for Bandit Problems.

We consider the problem of learning the ‘best’ action amongst a fixed reference set. Although our modelling assumptions will be quite different, we follow notation is somewhat similar to Section [DP]. We consider the following setting:

Consider actions $a\in {\mathcal A}$ and outcomes $\omega \in \Omega$. After choosing an action $a$, an outcome $\omega$ occurs and you receives a reward $r(a;\omega)\in [0,1]$. We assume that the set of actions in of size $N$.

Def [Policy] Over time outcomes $\omega_t$, $t=1,...,T$ occur. We do not (currently) assume that there is an underlying stochastic process determining the values of $\omega_t$; they may be chosen arbitrarily.

At each time $t$, a policy $\Pi$ chooses an action $\pi_t$ as a function of the past outcomes $\omega_s$, $s=1,...,t-1$ and their rewards $r(\cdot;\omega_s)$

In other words at each time the past states and rewards are known the policy. The policy then must do the best it can to accumulate rewards.

Def [Rewards/Regret] The total reward by time $T$ of policy $\Pi$ is

The regret of policy $\Pi$ is

• It is good to think of outcome $\omega_t$ as an outcome in a probability space. (Indeed sometimes we will assume this.)
• It is good to think of rewards $R_T(\Pi)$ in the same way as we consider rewards for an MDP $R_T(x,\Pi)$ (Though since there is no state we do not require notation for $x$).
• The regret compare how well our policy compares with the best fixed reward. I.e. retrospectively, we “regret” not making the best fixed choice.
• A regret of zero as $T\rightarrow \infty$ means we do as well as the best fixed choice. This is sometimes called Hannan consistency.

We are interested in how a policy $\Pi$ performs in comparison to each fixed policy $a \in {\mathcal A}$, that is a policy that chooses the same action $a$ at each time. Notice this is a weaker assumption compared to finding the best of all policies. This weaker notion of optimality is required because we are making much weaker assumptions about the evolution of the process $\omega_t$.

Ex 1. [A Regret Lower-Bound] Consider that there are two outcomes $x=A, B$ and two actions $a=A,B$ where $[r(A,A),r(A,B)] = [1,0]$ and $[r(A,B),r(B,B)] = [0,1]$. I.e. There is an action to go to $A$ or $B$ and if you match the outcome you get a reward of $1$, otherwise the reward is $0$.

Suppose that each $\omega_t$ is chosen uniformly at random from $\{A,B\}$ and is independent. Show that, regardless of what policy is used,

for some constant $C$. [Hint: Use the central limit theorem.]

Ans 1. Note that, since rewards are IID with probability $\frac{1}{2}$ for any policy $\Pi$ we have that

Further note that $R_T(A) + R_T(B)=T$ thus

We know that $\epsilon_T$ is the error of an iid sequence about its mean so the central limit theorem applies convergence in distribution

and

These last to observations give the result.

## The Weighted Majority Algorithm

We now consider a policy, that can achieve the lower bound suggested by [[OL:Ran]].

We prove the following result for the Weighted Majority Algorithm.

Thrm 1

and, $\eta=\sqrt{2T^{-1}\log(N)}$,

Other related results are also possible. We also require the following probability bound:

Lemma [Hoeffding’s Inequality] For a Random variable bounded on

Ex 2 Prove that

Ans 2

Ex 3 [Continued] Show that

Ans 3 Apply Hoeffding’s Inequality.

Ex 4 [Continued]

Ans 4 Trivial from definitions.

Ex 5 [Continued]

Ans 5 Combine [3] and [4], take logs and rearrange.

Ex 6 [Continued] Finally show that

Ans 6 Minimize [5] over $\eta$.

# Multi-armed Bandits: Non-Stochastic

We consider a multi-armed bandit setting for online learning.

Def [Bandit Policy] A bandit policy is a policy, cf. [[OL:Policy]], where $\pi_t$ is a function of the previous actions chosen $a_s$, $s=1,...,t-1$ and their costs $c(a_s; \omega_t)$. I.e. you cannot observe what cost would have happened if you had chosen a different action.

We consider a version of the weighted majority algorithm for this multi-armed bandit problem.

Note here that, since the policy choice $\pi_t$ is random, $\hat{c}_t(a ; \omega_t)$ is a random variable. We show the following

and for appropriate $\eta$

First we collect together a few facts then the proof proceeds very similar to the weight majority algorithm proof.

Ex 7 Show that for $c\geq 0$

(We do the result for losses rather than rewards because of this bound.1)

Ans 7 Follows as 4th term in the Taylor expansion is positive.

Ex 8 Show that

Ans 8

Ex 9 Show that

Ans 9 Similar to [8],

Ex 10 Show that

(Note: here we are not taking an expectation with respect to $\pi_t$ but we are just taking a sum weighted by $P_t(a)$.)

Ans 10

Ex 11 Show that

Ans 11 This is the same as [2-3] but use [7] at the last step instead of Hoeffding’s Inequality. Specifically,

Now multiply for $t=1,...,T$.

Ex 12 Show that for all $a$

Ans 12 Similar to [4] ,

Now take logs.

Ex 13 Show that

Ans 13 Combine inequalities [11] and [12] and taking logs gives

and rearrange. (Note: you need to use that $\log (1 + x) \leq x$ for all $x$.)

Ex 14 Show that

Ans 14 Take expectations on both sides of [13]

(Here the first “[??]” is Ex 8 the second is Ex 9) Now minimize the lefthand-side to give the regret.

Ex 15. Show that, for an appropriate choice of $\eta$,

Ans 15. Now minimize over $\eta$ the righthand bound in [14].

1. We could modify the policy for rewards by un-elegantly subtracting from the max reward.