Randomized weighted majority algorithm

From HandWiki

The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems.[1] It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert.

Example

Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions.

Motivation

In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows:

initialize all experts to weight 1
for each round:
    add each expert's weight to the option they predicted
    predict the option with the largest weighted sum
    multiply the weights of all experts who predicted wrongly by [math]\displaystyle{ \frac{1}{2} }[/math]


Suppose there are [math]\displaystyle{ n }[/math] experts and the best expert makes [math]\displaystyle{ m }[/math] mistakes. Then, the weighted majority algorithm (WMA) makes at most [math]\displaystyle{ 2.4(\log_2n+ m) }[/math] mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in [math]\displaystyle{ N = 100 }[/math] rounds using [math]\displaystyle{ n = 10 }[/math] experts, the best expert makes [math]\displaystyle{ m = 20 }[/math] mistakes. Then, the weighted majority algorithm only guarantees an upper bound of [math]\displaystyle{ 2.4 (\log_2 10 + 20) \approx 56 }[/math] mistakes.

As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on [math]\displaystyle{ m }[/math]. In particular, we can do better by introducing randomization.

Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does).[2]

Randomized weighted majority algorithm (RWMA)

The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on [math]\displaystyle{ m }[/math]. Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority).

Precisely, if [math]\displaystyle{ w_i }[/math] is the weight of expert [math]\displaystyle{ i }[/math], let [math]\displaystyle{ W=\sum_iw_i }[/math]. We will follow expert [math]\displaystyle{ i }[/math] with probability [math]\displaystyle{ \frac{w_i}{W} }[/math]. This results in the following algorithm:

initialize all experts to weight 1.
for each round:
    add all experts' weights together to obtain the total weight [math]\displaystyle{ W }[/math]
    choose expert [math]\displaystyle{ i }[/math] randomly with probability [math]\displaystyle{ \frac{w_i}{W} }[/math] 
    predict as the chosen expert predicts
    multiply the weights of all experts who predicted wrongly by [math]\displaystyle{ \beta }[/math]

The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day.

The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by [math]\displaystyle{ \beta \lt 1 }[/math] instead of strictly [math]\displaystyle{ \frac{1}{2} }[/math] allows us to trade off between dependence on [math]\displaystyle{ m }[/math] and [math]\displaystyle{ \log_2n }[/math]. This trade-off will be quantified in the analysis section.

Analysis

Let [math]\displaystyle{ W_t }[/math] denote the total weight of all experts at round [math]\displaystyle{ t }[/math]. Also let [math]\displaystyle{ F_t }[/math] denote the fraction of weight placed on experts which predict the wrong answer at round [math]\displaystyle{ t }[/math]. Finally, let [math]\displaystyle{ N }[/math] be the total number of rounds in the process.

By definition, [math]\displaystyle{ F_t }[/math] is the probability that the algorithm makes a mistake on round [math]\displaystyle{ t }[/math]. It follows from the linearity of expectation that if [math]\displaystyle{ M }[/math] denotes the total number of mistakes made during the entire process, [math]\displaystyle{ E[M] = \sum_{t=1}^N F_t }[/math].

After round [math]\displaystyle{ t }[/math], the total weight is decreased by[math]\displaystyle{ \ (1-\beta)F_tW_t }[/math], since all weights corresponding to a wrong answer are multiplied by[math]\displaystyle{ \ \beta \lt 1 }[/math]. It then follows that [math]\displaystyle{ W_{t+1} = W_t(1-(1-\beta)F_t) }[/math]. By telescoping, since [math]\displaystyle{ W_1 = n }[/math], it follows that the total weight after the process concludes is

[math]\displaystyle{ \begin{align} W=n \prod_{t=1}^N (1-(1-\beta)F_t). \end{align} }[/math]

On the other hand, suppose that [math]\displaystyle{ \ m }[/math] is the number of mistakes made by the best-performing expert. At the end, this expert has weight [math]\displaystyle{ \ \beta^m }[/math]. It follows, then, that the total weight is at least this much; in other words, [math]\displaystyle{ \ W\geq \beta^m }[/math]. This inequality and the above result imply

[math]\displaystyle{ \begin{align} n \prod_{t=1}^N (1-(1-\beta)F_t) \geq \beta^m. \end{align} }[/math]

Taking the natural logarithm of both sides yields

[math]\displaystyle{ \begin{align} \ln n + \sum_{t=1}^N \ln (1-(1-\beta)F_t) \geq m \ln\beta . \end{align} }[/math]

Now, the Taylor series of the natural logarithm is

[math]\displaystyle{ \begin{align} \ln (1-x)= - x - \frac {x^2}{2} - \frac {x^3}{3} - \cdots \end{align} }[/math]

In particular, it follows that[math]\displaystyle{ \ \ln (1-(1-\beta)F_t)\lt -(1-\beta)F_t }[/math].
Thus,

[math]\displaystyle{ \begin{align} \ln n - (1-\beta) \sum_{t=1}^N F_t \geq m \ln\beta . \end{align} }[/math]

Recalling that [math]\displaystyle{ E[M] = \sum_{t=1}^N F_t }[/math] and rearranging, it follows that

[math]\displaystyle{ \begin{align} E[M] \leq \frac {m \ln(1/\beta) + \ln(n)}{1-\beta} = \frac{ \ln(1/\beta) }{1-\beta}m + \frac{1}{1-\beta}\ln(n) . \end{align} }[/math]

Now, as [math]\displaystyle{ \beta \to 1 }[/math] from below, the first constant tends to [math]\displaystyle{ 1 }[/math]; however, the second constant tends to [math]\displaystyle{ +\infty }[/math]. To quantify this tradeoff, define [math]\displaystyle{ \varepsilon = 1-\beta }[/math] to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm,

[math]\displaystyle{ \begin{align} \frac{\ln(1/\beta)}{1-\beta} = -\frac{\ln(\beta)}{1-\beta} = \frac{-\ln(1-\varepsilon)}{\varepsilon} = \frac{\varepsilon + \frac {\varepsilon^2}{2} + \frac {\varepsilon^3}{3} + \cdots}{\varepsilon} = 1 + \frac {\varepsilon}{2} + O(\varepsilon^2) \end{align} }[/math]

It then follows that the mistake bound, for small [math]\displaystyle{ \varepsilon }[/math], can be written in the form [math]\displaystyle{ \ \left(1+\frac{\epsilon}{2}+O(\varepsilon^2)\right)m + \epsilon^{-1}\ln(n) }[/math].

In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of [math]\displaystyle{ \varepsilon }[/math] and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert.

In particular, as long as [math]\displaystyle{ m }[/math] is sufficiently large compared to [math]\displaystyle{ \ln(n) }[/math] (so that their ratio is sufficiently small), we can assign

[math]\displaystyle{ \begin{align} \varepsilon = \sqrt{\frac{\ln(n)}{m}} \end{align} }[/math]

we can obtain an upper bound on the number of mistakes equal to

[math]\displaystyle{ \begin{align} m + O(\sqrt{m \ln(n)}). \end{align} }[/math]

This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at [math]\displaystyle{ O(\sqrt{m \ln(n)}) }[/math].

Revisiting the motivation

Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in [math]\displaystyle{ N = 100 }[/math] rounds, with [math]\displaystyle{ n = 10 }[/math] experts, where the best expert makes [math]\displaystyle{ m = 20 }[/math] mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of [math]\displaystyle{ 2.4 (\log_2 10 + 20) \approx 56 }[/math]. By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the function

[math]\displaystyle{ \begin{align} \frac{ \ln (1/\beta) }{1-\beta}20 + \frac{1}{1-\beta}\ln (10) . \end{align} }[/math]

Computational methods show that the optimal value is roughly [math]\displaystyle{ \beta \approx 0.641 }[/math], which results in the minimal worst-case number of expected mistakes of [math]\displaystyle{ E[M] \approx 31.19 }[/math]. When the number of rounds is increased (say, to [math]\displaystyle{ N = 1000000 }[/math]) while the accuracy rate of the best expert is kept the same the improvement can be even more dramatic; the weighted majority algorithm guarantees only a worst-case mistake rate of 48.0%, but the randomized weighted majority algorithm, when properly tuned to the optimal value of [math]\displaystyle{ \varepsilon \approx 0.0117 }[/math], achieves a worst-case mistake rate of 20.2%.

Uses of Randomized Weighted Majority Algorithm (RWMA)

The Randomized Weighted Majority Algorithm can be used to combine multiple algorithms in which case RWMA can be expected to perform nearly as well as the best of the original algorithms in hindsight. Note that the RWMA can be generalized to solve problems which do not have binary mistake variables, which makes it useful for a wide class of problems.

Furthermore, one can apply the Randomized Weighted Majority Algorithm in situations where experts are making choices that cannot be combined (or can't be combined easily). For example, RWMA can be applied to repeated game-playing or the online shortest path problem. In the online shortest path problem, each expert is telling you a different way to drive to work. You pick one path using RWMA. Later you find out how well you would have done using all of the suggested paths and penalize appropriately. The goal is to have an expected loss not much larger than the loss of the best expert.

Applications in software

The randomized weighted majority algorithm has been proposed as a new method for several practical software applications, particularly in the domains of bug detection and cyber-security.[3] [4] For instance, Varsha and Madhavu (2021) describe how the randomized weighted majority algorithm can be used to replace conventional voting within a random forest classification approach to detect insider threats. Using experimental results, they show that this approach obtained a higher level of accuracy and recall compared to the standard random forest algorithm. Moustafa et al. (2018) have studied how an ensemble classifier based on the randomized weighted majority algorithm could be used to detect bugs earlier in the software development process, after being trained on existing software repositories.

Extensions

  • Multi-armed bandit problem.
  • Efficient algorithm for some cases with many experts.
  • Sleeping experts/"specialists" setting.

See also

References

  1. Littlestone, N.; Warmuth, M. (1994). "The Weighted Majority Algorithm". Information and Computation 108 (2): 212–261. doi:10.1006/inco.1994.1009. 
  2. "COS 511: Foundations of Machine Learning". 20 March 2006. http://www.cs.princeton.edu/courses/archive/spr06/cos511/scribe_notes/0330.pdf. 
  3. Suresh, P. Varsha; Madhavu, Minu Lalitha (2021). "Insider Attack: Internal Cyber Attack Detection Using Machine Learning". 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). pp. 1–7. doi:10.1109/ICCCNT51525.2021.9579549. ISBN 978-1-7281-8595-8. 
  4. Moustafa, Sammar; El Nainay, Mustafa Y; El Makky, Nagwa; Abougabal, Mohamed S. (2018). "Software bug prediction using weighted majority voting techniques". Alexandria Engineering Journal 57 (4): 2763–2774. doi:10.1016/j.aej.2018.01.003. 

Further reading