The Weighted Majority Algorithm is a randomized rule used to learn the best action amongst a fixed reference set.

We consider the following setting

- There are a fixed set of actions which one may choose. There are a set of outcomes .
- After choosing an action , an outcome occurs and you receives a reward . Over time a policy chooses actions and outcomes , occur.
- It is assumed that the function is known by the policy.
- The accumulated a net reward is

We are interested in how our dynamic policy performs in comparison to each fixed policy , that is a policy that chooses the same action at each time. In particular, we are interested in how this compares to the fixed policy which is retrospectively the best.

- For this reason, we consider the
*regret*of policy

- We can only retrospectively find the best fixed policy, whilst our policy must behave adaptively based on historical information. Thus quantifies how we regret not having had the information to have chosen the best fixed policy.
- Note a ‘good’ policy would have low regret. For instance, we might hope our dynamic policy is as good as the best fixed policy, i.e. .

Using Blackwell’s Approachability Theorem, we saw that low regret policies exist. See Section [Blackwell]Theorem [blackwell:regret]. One of the key advantages of such algorithms is that is does not place statistical assumptions on the outcome sequence . The Weighted Majority Algorithm is a further example of an algorithm that has asymptotically low regret and does not require statistical assumptions to be placed on the input sequence.

**Weighted Majority Algorithm**

For parameter , the \emph{(exponentially) weighted majority algorithm} is a randomized policy which chooses a policy $\latex i=1,…,N$ at time with probability for and for

We can derived the following regret bound

**Theorem:** If policy is the the weighted majority algorithm

(1) With fixed,

for choice , the above bound states

(2) For fixed the policy,

(3) For varrying over time and increasing, we have

for choice , the above bound states

•

**Proof:**

(1) Observe

In the second inequaltiy, we apply the Azuma-Hoeffding Inequality, see Section [Azuma]. Taking logs and rearranging, we gain the required expression . Substituting , we gain expression .

(2) We note the following inequality holds

In the first inequality, we bound with the line segment above the exponential. In the second inequality, we apply the bound . Applying bound to inequality , we have

Taking logs and rearranging gives the required bound.

(3) The crux of the proof (1) was that the function is decreasing and thus bounded above by . We apply the same prinicple, but we have to take account of the change in parameter .

We can express our weights differently, notice,

will play a similar role to that of in the proof of (1):

Since , the sequence is decreasing and bounded above by . Thus, as , we get the result.