Weighted majority algorithm its variant for Bandit Problems.

We consider the problem of learning the ‘best’ action amongst a fixed reference set. Although our modelling assumptions will be quite different, we follow notation is somewhat similar to Section [DP]. We consider the following setting:

Consider actions $a\in {\mathcal A}$ and outcomes $\omega \in \Omega$ . After choosing an action $a$ , an outcome $\omega$ occurs and you receives a reward $r(a;\omega)\in [0,1]$ . We assume that the set of actions in of size $N$ .

Def [Policy] Over time outcomes $\omega_t$ , $t=1,...,T$ occur. We do not (currently) assume that there is an underlying stochastic process determining the values of $\omega_t$ ; they may be chosen arbitrarily.

At each time $t$ , a policy $\Pi$ chooses an action $\pi_t$ as a function of the past outcomes $\omega_s$ , $s=1,...,t-1$ and their rewards $r(\cdot;\omega_s)$

In other words at each time the past states and rewards are known the policy. The policy then must do the best it can to accumulate rewards.

Def [Rewards/Regret] The total reward by time $T$ of policy $\Pi$ is

$R_T(\Pi):=\sum_{t=1}^T r(\pi_t,\omega_t).$

The regret of policy $\Pi$ is

$\mathcal{R}\! g_T(\Pi) := \max_{a\in {\mathcal A} } R_T(a) - \bE [R_T(\Pi)].$

It is good to think of outcome $\omega_t$ as an outcome in a probability space. (Indeed sometimes we will assume this.)
It is good to think of rewards $R_T(\Pi)$ in the same way as we consider rewards for an MDP $R_T(x,\Pi)$ (Though since there is no state we do not require notation for $x$ ).
The regret compare how well our policy compares with the best fixed reward. I.e. retrospectively, we “regret” not making the best fixed choice.
A regret of zero as $T\rightarrow \infty$ means we do as well as the best fixed choice. This is sometimes called Hannan consistency.

We are interested in how a policy $\Pi$ performs in comparison to each fixed policy $a \in {\mathcal A}$ , that is a policy that chooses the same action $a$ at each time. Notice this is a weaker assumption compared to finding the best of all policies. This weaker notion of optimality is required because we are making much weaker assumptions about the evolution of the process $\omega_t$ .

Ex 1. [A Regret Lower-Bound] Consider that there are two outcomes $x=A, B$ and two actions $a=A,B$ where $[r(A,A),r(A,B)] = [1,0]$ and $[r(A,B),r(B,B)] = [0,1]$ . I.e. There is an action to go to $A$ or $B$ and if you match the outcome you get a reward of $1$ , otherwise the reward is $0$ .

Suppose that each $\omega_t$ is chosen uniformly at random from $\{A,B\}$ and is independent. Show that, regardless of what policy is used,

$\mathbb E\left[ {\mathcal R}\! g_T(\Pi) \right]\geq C \sqrt{T}$

for some constant $C$ . [Hint: Use the central limit theorem.]

Ans 1. Note that, since rewards are IID with probability $\frac{1}{2}$ for any policy $\Pi$ we have that

$\mathbb E [ R_T(\Pi) ] = \frac{1}{2} T$

Further note that $R_T(A) + R_T(B)=T$ thus

$R_T(A) = \frac{T}{2} + \epsilon_T, \qquad R_T(B) = \frac{T}{2} - \epsilon_T$

We know that $\epsilon_T$ is the error of an iid sequence about its mean so the central limit theorem applies convergence in distribution

$\frac{\epsilon_T}{\sqrt{T}} \xrightarrow[T\rightarrow\infty ]{\mathcal D} \mathcal N \big(0,\frac{1}{2}\big)$

and

$\mathcal{R}\! g_T(\Pi) := \max_{a\in {\mathcal A} } R_T(a) - \bE [R_T(\Pi)] = | \epsilon_T | .$

These last to observations give the result.

The Weighted Majority Algorithm

We now consider a policy, that can achieve the lower bound suggested by [[OL:Ran]].

We prove the following result for the Weighted Majority Algorithm.

Thrm 1

$\label{WMA:reg} \mathbb E \left[ \mathcal R\! g_T (\mathcal W_{Maj}) \right] \leq \eta^{-1} \log(N) + \frac{\eta T}{2},$

and, $\eta=\sqrt{2T^{-1}\log(N)}$ ,

$\label{WMA:reg2} \mathbb E \left[ \mathcal R\! g_T (\mathcal W_{Maj}) \right] \leq 2\sqrt{2T \log(N)}.$

Other related results are also possible. We also require the following probability bound:

Lemma [Hoeffding’s Inequality] For a Random variable bounded on

$\mathbb E [e^{\eta X}] \leq e^{\eta \mathbb E X + \frac{\eta^2}{2}}$

Ex 2 Prove that

$W(T+1) = N \prod_{t=1}^T {\mathbb E} e^{\eta r(\pi_t,y_T)}.$

Ans 2 $\begin{aligned} W_{T+1} = W_0 \prod_{t=0}^{T-1} \frac{W_{t+1}}{W_t} = N \prod_{t=0}^{T-1} \sum_i \frac{e^{\eta R_t(i) } }{W_t} & = N \prod_{t=1}^T \sum_i e^{\eta r_t(i)} \underbrace{ \frac{e^{\eta R_{t-1}(i)}}{W_t} }_{ = P_t(i) }\\ & = N \prod_{t=1}^T \mathbb E\left[ e^{\eta r(\pi_t,\omega_t)} \right] \end{aligned}$

Ex 3 [Continued] Show that

$W_{T+1} \leq N e^{\eta \mathbb E R_T(\mathcal W_{Maj})+ \eta^2 T}$

Ans 3 Apply Hoeffding’s Inequality.

${\mathbb E} [ e^{\eta r_t(\pi_t)}] \leq \exp\Big\{ \eta {\mathbb E} r_t(\pi_t) + \frac{\eta^2}{2}\Big\}.$

Ex 4 [Continued]

$e^{\eta \max_i R_T(i) } \leq W_{T+1}$

Ans 4 Trivial from definitions.

Ex 5 [Continued]

${\mathbb E} [ {\mathcal R}\! g_T({\mathcal W}\! _{Maj})] \leq \eta^{-1} \log(N) + \frac{\eta T}{2}.$

Ans 5 Combine [3] and [4], take logs and rearrange.

Ex 6 [Continued] Finally show that

${\mathbb E}[ {\mathcal R}\! g_T({\mathcal W}\! _{Maj}) ] \leq 2\sqrt{2T \log(N)}.$

Ans 6 Minimize [5] over $\eta$ .

Multi-armed Bandits: Non-Stochastic

We consider a multi-armed bandit setting for online learning.

Def [Bandit Policy] A bandit policy is a policy, cf. [[OL:Policy]], where $\pi_t$ is a function of the previous actions chosen $a_s$ , $s=1,...,t-1$ and their costs $c(a_s; \omega_t)$ . I.e. you cannot observe what cost would have happened if you had chosen a different action.

We consider a version of the weighted majority algorithm for this multi-armed bandit problem.

Note here that, since the policy choice $\pi_t$ is random, $\hat{c}_t(a ; \omega_t)$ is a random variable. We show the following

${\mathcal R}\! g_T (Exp^3) \leq \frac{1}{\eta} \log N + \frac{\eta}{2} T N.$

and for appropriate $\eta$

${\mathcal R}\! g_T (Exp^3) \leq 2\sqrt{2TN \log N}$

First we collect together a few facts then the proof proceeds very similar to the weight majority algorithm proof.

Ex 7 Show that for $c\geq 0$

$e^{-c} \leq 1-l + \frac{c^2}{2}.$

(We do the result for losses rather than rewards because of this bound.¹)

Ans 7 Follows as 4th term in the Taylor expansion is positive.

Ex 8 Show that

${\mathbb E}[\hat{c}_t(a; \omega_t)] = c(a;\omega_t)$

Ans 8

${\mathbb E}[\hat{c}_t(a ; \omega_t)] = P_t({a}) \frac{c(a;\omega)}{P_t(a)} = c(a;\omega)$

Ex 9 Show that

$P_t(a){\mathbb E}[\hat{c}_t(a,\omega_t)^2] \leq 1$

Ans 9 Similar to [8],

$P_t(a){\mathbb E}[\hat{c}_t(a,\omega_t)^2] = P_t(a)\cdot P_t({a}) \frac{c(a;\omega)^2}{P_t(a)^2} = c(a;\omega)^2 \leq 1 \, .$

Ex 10 Show that

$\sum_{a\in\mathcal A} P_t(a) \hat{c}_t ( a ; \omega ) = c(\pi_t ; \omega )$

(Note: here we are not taking an expectation with respect to $\pi_t$ but we are just taking a sum weighted by $P_t(a)$ .)

Ans 10

$\sum_a P_t(a) \hat{c}_t(a;\omega_{t}) = \sum_a P_t(a) \frac{c(a;\omega_{t})}{P_t(a)} {\mathbb I}[\pi_{t}=a] = c(\pi_t;\omega_{t})$

Ex 11 Show that

$W_{T+1} \leq N \prod_{t=1}^{T} \left( 1 - \eta c(\pi_t, \omega_t) + \eta^2 \sum_{a\in \mathcal A} P_t(a) \hat{c}_t(a, \omega_t)^2 \right)$

Ans 11 This is the same as [2-3] but use [7] at the last step instead of Hoeffding’s Inequality. Specifically,

$\begin{aligned} \frac{W_{t+1}}{W_t} = \sum_{a\in\mathcal A} \frac{w_{t+1}(a)}{W_t} = \sum_{a\in\mathcal A} \frac{w_{t}(a)}{W_t} e^{-\eta \hat{c}_t(a;\omega_t)} & \leq \sum_{a\in\mathcal A} P_{t}(a) \left( 1- \eta \hat{c}_t(a ; \omega_t) + \frac{\eta^2}{2} \hat{c}_t(a ; \omega_t )^2 \right) \\ & \overset{[\ref{OL:EEE3.5}]}{=} 1 - \eta c(\pi_t,\omega_t) + \frac{\eta^2}{2} \sum_{a\in\mathcal A} P_{t}(a) \hat{c}_t(a;\omega_t )^2.\end{aligned}$ Now multiply for $t=1,...,T$ .

Ex 12 Show that for all $a$

$-\eta \sum_{t=1}^T\hat{c}_t(a;\omega_t) \leq \log {W_{T+1}}.$

Ans 12 Similar to [4] ,

$W_{T+1} = \sum_a w_{T+1}(a) \geq w_{T+1}(a) = e^{-\sum_{t=1}^T \eta \hat{c}_{t}(a ; \omega_t) }$

Now take logs.

Ex 13 Show that

$C_T(Exp^3)- \sum_{t=1}^T \hat{c}_t(a;\omega_t) \leq \frac{1}{\eta}\log N + \eta \sum_{t=1}^T \sum_{a\in{\mathcal A}} P_{t}(a)\hat{c}_t(a; \omega_t)^2$

Ans 13 Combine inequalities [11] and [12] and taking logs gives

$-\eta \sum_{t=1}^T \hat{c}_t(a;\omega_t) \leq +\log N -\eta C_T(Exp^3) + \eta^2 \sum_{t=1}^T \sum_{a\in{\mathcal A}} P_{t}(a)\hat{c}_t(a; \omega_t)^2$ and rearrange. (Note: you need to use that $\log (1 + x) \leq x$ for all $x$ .)

Ex 14 Show that

${\mathcal R}g (T,Exp^3) \leq \frac{1}{\eta} \log N + \frac{\eta}{2} T N.$

Ans 14 Take expectations on both sides of [13]

$\mathbb E \left[ C_T(Exp^3) \right] - \underbrace{ \mathbb E \left[ \sum_{t=1}^T \hat{c}_t(a;\omega_t) \right] }_{ \overset{[\ref{OL:EEE2}]}{=} C_T(a) } \leq \frac{1}{\eta}\log N + \eta \sum_{t=1}^T \sum_{a\in{\mathcal A}} \underbrace{ \mathbb E \left[ P_{t}(a)\hat{c}_t(a; \omega_t)^2 \right] }_{ \overset{[\ref{OL:EEE3}]}{\leq} 1 }$ (Here the first “[??]” is Ex 8 the second is Ex 9) Now minimize the lefthand-side to give the regret.

Ex 15. Show that, for an appropriate choice of $\eta$ ,

${\mathcal R}\! g_T (Exp^3) \leq 2\sqrt{2TN \log N}$

Ans 15. Now minimize over $\eta$ the righthand bound in [14].

We could modify the policy for rewards by un-elegantly subtracting from the max reward.↩

Experts and Bandits (non-stochastic)

The Weighted Majority Algorithm

Multi-armed Bandits: Non-Stochastic

Leave a comment Cancel reply

The Weighted Majority Algorithm

Multi-armed Bandits: Non-Stochastic

Share this:

Leave a comment Cancel reply