Weighted Majority Algorithm

The Weighted Majority Algorithm is a randomized rule used to learn the best action amongst a fixed reference set.

We consider the following setting

  • There are a fixed set of actions i=1,...,N which one may choose. There are a set of outcomes y\in{\mathcal Y}.
  • After choosing an action i, an outcome y occurs and you receives a reward r(i,y)\in [0,1]. Over time a policy \pi chooses actions \pi_t t=1,...,T and outcomes y_t, t=1,...,T occur.
  • It is assumed that the function r:\{1,...,N\}\times {\mathcal Y} \rightarrow [0,1] is known by the policy.
  • The accumulated a net reward is

We are interested in how our dynamic policy \pi performs in comparison to each fixed policy i=1,...,N, that is a policy that chooses the same action i at each time. In particular, we are interested in how this compares to the fixed policy which is retrospectively the best.

  • For this reason, we consider the regret of policy \pi

  • We can only retrospectively find the best fixed policy, whilst our policy \pi must behave adaptively based on historical information. Thus R(\pi,T) quantifies how we regret not having had the information to have chosen the best fixed policy.
  • Note a ‘good’ policy would have low regret. For instance, we might hope our dynamic policy is as good as the best fixed policy, i.e. R(\pi,T)\leq 0.

Using Blackwell’s Approachability Theorem, we saw that low regret policies exist. See Section [Blackwell]Theorem [blackwell:regret]. One of the key advantages of such algorithms is that is does not place statistical assumptions on the outcome sequence y_1,...,y_T. The Weighted Majority Algorithm is a further example of an algorithm that has asymptotically low regret and does not require statistical assumptions to be placed on the input sequence.

Weighted Majority Algorithm

For parameter \eta>0, the \emph{(exponentially) weighted majority algorithm} is a randomized policy which chooses a policy $\latex i=1,…,N$ at time t with probability P(i,N)=\frac{1}{N} for t=1 and for t\geq 1

P(i,t)=\frac{w_{i,t}}{W(t)},\qquad\text{where}\qquad w_{i,t}=e^{\eta\rho(i,t-1)} \quad\text{and}\quad W(t)=\sum_{i=1}^N e^{\eta \rho(i,t)}.

We can derived the following regret bound

Theorem: If policy \pi is the the weighted majority algorithm

(1) With \eta fixed,

for choice \eta=\sqrt{2T^{-1}\log(N)}, the above bound states

(2) For \eta fixed the policy,

(3) For \eta_t varrying over time and increasing, we have

for choice \eta_t=\sqrt{4t^{-1}\log(N)}, the above bound states


(1) Observe


In the second inequaltiy, we apply the Azuma-Hoeffding Inequality, see Section [Azuma]. Taking logs and rearranging, we gain the required expression . Substituting \eta=\sqrt{2T^{-1}\log(N)}, we gain expression .

(2) We note the following inequality holds

In the first inequality, we bound with the line segment above the exponential. In the second inequality, we apply the bound 1+x\leq e^x. Applying bound to inequality , we have

Taking logs and rearranging gives the required bound.

(3) The crux of the proof (1) was that the function N^{-1} W(T) \exp\{-\eta {\mathbb E} \rho(\pi,T) - \eta^2 T^2/2\} is decreasing and thus bounded above by 1. We apply the same prinicple, but we have to take account of the change in parameter \eta_t.

We can express our weights differently, notice,

S(t)^{\eta_t^{-1}} will play a similar role to that of N^{-1} W(T) \exp\{-\eta {\mathbb E} \rho(\pi,T) - \eta^2 T^2/2\} in the proof of (1):

Since S(0)=1, the sequence is decreasing and bounded above by 1. Thus, as \frac{1}{N}s_{i,T}\leq S(T)\leq 1, we get the result. \square

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: