Upper Confidence Bound

Upper Confidence Bound (UCB)
Class	Multi-armed bandit; Reinforcement learning
Data structure	Sequential reward observations
Worst-case performance	O(K) per round (K = number of arms)
Average performance	O(K)
Worst-case space complexity	O(K)

Short description: Family of machine learning algorithms for bandit problems

Upper Confidence Bound (UCB) is a family of algorithms in machine learning and statistics for solving the multi-armed bandit problem and addressing the exploration–exploitation trade-off. UCB methods select actions by computing an upper confidence estimate of each action’s potential reward, thus balancing exploration of uncertain options with exploitation of those known to perform well. Confidence-bound methods for stochastic bandits date back to work of Tze Leung Lai and Herbert Robbins in 1985, and the first UCB algorithm is due to Lai (1987).^[1]^[2]^[3] UCB and its variants have become standard techniques in reinforcement learning, online advertising, recommender systems, clinical trials, and Monte Carlo tree search.^[4]^[5]^[6]

Background

The multi-armed bandit problem models a scenario where an agent chooses repeatedly among $K$ options ("arms"), each yielding stochastic rewards, with the goal of maximizing the sum of collected rewards over time. The main challenge is the exploration–exploitation trade-off: the agent must explore lesser-tried arms to learn their rewards, yet exploit the best-known arm to maximize payoff.^[6] Traditional $ε$ -greedy or softmax strategies use randomness to force exploration; UCB algorithms instead use statistical confidence bounds to guide exploration more efficiently.^[5]

The UCB1 algorithm

Error creating thumbnail:

Behaviour of an UCB algorithm on a bandit run

UCB1 is a widely used bounded-reward variant of UCB introduced by Auer, Cesa-Bianchi and Fischer (2002).^[4] It maintains for each arm $i$ :

the empirical mean reward ${\hat{μ}}_{i}$ ,
the count $n_{i}$ of times arm $i$ has been played.

At round $t$ , it selects the arm maximizing:

${U C B 1}_{i} (t) = {\hat{μ}}_{i} + \sqrt{\frac{2 \ln t}{n_{i}}}$

Arms with $n_{i} = 0$ are initially played once. The bonus term $\sqrt{2 \ln t / n_{i}}$ shrinks as $n_{i}$ grows, ensuring exploration of less-tried arms and exploitation of high-mean arms.^[4]

Pseudocode

for each arm i:
    n[i] ← 0; Q[i] ← 0
for t from 1 to T do:
    for each arm i do
        if n[i] = 0 then
            select arm i
        else
            index[i] ← Q[i] + sqrt((2 * ln t) / n[i])
    select arm a with highest index[a]
    observe reward r
    n[a] ← n[a] + 1
    Q[a] ← Q[a] + (r - Q[a]) / n[a]

Theoretical properties

Auer et al. proved that UCB1 achieves logarithmic regret: after $n$ rounds, the expected regret $R (n)$ satisfies

$R (n) = O (\sum_{i : Δ_{i} > 0} \frac{\ln n}{Δ_{i}}),$

where $Δ_{i}$ is the gap between the optimal arm’s mean and arm $i$ ’s mean. Thus, average regret per round tend to $0$ as $n \to + \infty$ , and UCB1 is near-optimal against the Lai-Robbins lower bound.^[4]^[1]

Variants

Several extensions improve or adapt UCB to different settings:

UCB2

Introduced in the same paper as UCB1, UCB2 divides plays into epochs controlled by a parameter $α$ , reducing the constant in the regret bound at the cost of more complex scheduling.^[4]

UCB1-Tuned

Incorporates empirical variance $V_{i}$ to tighten the bonus: ${\hat{μ}}_{i} + \sqrt{\frac{\ln t}{n_{i}} \min {1 / 4, V_{i}}} .$ This often outperforms UCB1 in practice but lacks a simple regret proof.^[4]

KL-UCB

Replaces Hoeffding’s bound with a Kullback–Leibler divergence condition, yielding asymptotically optimal regret (constant = $1$ ) for Bernoulli rewards.^[7]^[8]

Bayesian UCB (Bayes-UCB)

Computes the $(1 - δ)$ -quantile of a Bayesian posterior (e.g. Beta for Bernoulli) as the index. Proven asymptotically optimal under certain priors.^[9]

Contextual UCB (e.g., LinUCB)

Extends UCB to contextual bandits by estimating a linear reward model and confidence ellipsoids in parameter space. Widely used in news recommendation.^[10]

Applications

UCB algorithms’ simplicity and strong guarantees make them popular in:

Online advertising & A/B testing: adaptively allocate traffic to maximize conversion rates without fixed split ratios.^[6]
Monte Carlo Tree Search: UCT uses UCB1 at each tree node to guide exploration in games like Go.^[11]^[12]
Adaptive clinical trials: assign patients to treatments with highest upper confidence on success, improving outcomes over randomization.^[13]
Recommender systems: personalized content selection under uncertainty.
Robotics & control: efficient exploration of unknown dynamics.

References

↑ ^1.0 ^1.1 Lai, Tze Leung; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
↑ Lai, Tze Leung (1987). "Adaptive treatment allocation and the multi-armed bandit problem". The Annals of Statistics 15 (3): 1091–1114. doi:10.1214/AOS/1176350495.
↑ Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. ISBN 978-1-108-48682-8.
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 Auer, Peter; Cesa-Bianchi, Nicolo; Fischer, Paul (2002). "Finite-time Analysis of the Multiarmed Bandit Problem". Machine Learning 47: 235–256. doi:10.1023/A:1013689704352. https://link.springer.com/article/10.1023/A:1013689704352.
↑ ^5.0 ^5.1 Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN 978-0-262-03924-6.
↑ ^6.0 ^6.1 ^6.2 Bubeck, Sébastien; Cesa-Bianchi, Nicolo (2012). "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems". Foundations and Trends in Machine Learning 5 (1): 1–122. doi:10.1561/2200000024.
↑ Garivier, Aurélien; Cappé, Olivier (2011). "The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond". 19. JMLR Workshop and Conference Proceedings. pp. 359–376. https://proceedings.mlr.press/v19/garivier11a.html.
↑ Maillard, Olivier-Alain; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergence". 19. JMLR Workshop and Conference Proceedings. pp. 497–514. https://proceedings.mlr.press/v19/maillard11a.html.
↑ Kaufmann, Emilie; Cappé, Olivier; Garivier, Aurélien (2012). "Bayesian Upper Confidence Bounds for Bandit Problems". 1. pp. 2177–85.
↑ Li, Lihong; Chu, Wei; Langford, John; Schapire, Robert E. (2010). "A contextual-bandit approach to personalized news article recommendation". pp. 661–670. doi:10.1145/1772690.1772758.
↑ Kocsis, László; Szepesvári, Csaba (2006). "Bandit based Monte-Carlo planning". pp. 282–293. doi:10.1007/11871842_29.
↑ Silver, David; Huang, Aja; Maddison, Chris J. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature 529 (7587): 484–9. doi:10.1038/nature16961. PMID 26819042. Bibcode: 2016Natur.529..484S.
↑ Safikhani, Maryam; Jo, Yuxin (2022). "Retrospective analysis indicates bandit approaches improve patient outcomes in clinical trials". BMC Medical Research Methodology 22: 117. doi:10.1186/s12874-022-01636-7.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Upper Confidence Bound. Read more

[LaiRobbins1985-1] 1.0 ^1.1 Lai, Tze Leung; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.

[Lai1987-2] Lai, Tze Leung (1987). "Adaptive treatment allocation and the multi-armed bandit problem". The Annals of Statistics 15 (3): 1091–1114. doi:10.1214/AOS/1176350495.

[LattimoreSzepesvari2020-3] Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. ISBN 978-1-108-48682-8.

[Auer2002-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 Auer, Peter; Cesa-Bianchi, Nicolo; Fischer, Paul (2002). "Finite-time Analysis of the Multiarmed Bandit Problem". Machine Learning 47: 235–256. doi:10.1023/A:1013689704352. https://link.springer.com/article/10.1023/A:1013689704352.

[SuttonBarto2018-5] 5.0 ^5.1 Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN 978-0-262-03924-6.

[BubeckCesaBianchi2012-6] 6.0 ^6.1 ^6.2 Bubeck, Sébastien; Cesa-Bianchi, Nicolo (2012). "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems". Foundations and Trends in Machine Learning 5 (1): 1–122. doi:10.1561/2200000024.

[GarivierCappe2011-7] Garivier, Aurélien; Cappé, Olivier (2011). "The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond". 19. JMLR Workshop and Conference Proceedings. pp. 359–376. https://proceedings.mlr.press/v19/garivier11a.html.

[Maillard2011-8] Maillard, Olivier-Alain; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergence". 19. JMLR Workshop and Conference Proceedings. pp. 497–514. https://proceedings.mlr.press/v19/maillard11a.html.

[Kaufmann2012-9] Kaufmann, Emilie; Cappé, Olivier; Garivier, Aurélien (2012). "Bayesian Upper Confidence Bounds for Bandit Problems". 1. pp. 2177–85.

[Li2010-10] Li, Lihong; Chu, Wei; Langford, John; Schapire, Robert E. (2010). "A contextual-bandit approach to personalized news article recommendation". pp. 661–670. doi:10.1145/1772690.1772758.

[KocsisSzepesvari2006-11] Kocsis, László; Szepesvári, Csaba (2006). "Bandit based Monte-Carlo planning". pp. 282–293. doi:10.1007/11871842_29.

[Silver2016-12] Silver, David; Huang, Aja; Maddison, Chris J. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature 529 (7587): 484–9. doi:10.1038/nature16961. PMID 26819042. Bibcode: 2016Natur.529..484S.

[Safikhani2022-13] Safikhani, Maryam; Jo, Yuxin (2022). "Retrospective analysis indicates bandit approaches improve patient outcomes in clinical trials". BMC Medical Research Methodology 22: 117. doi:10.1186/s12874-022-01636-7.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Anonymous

Search

Upper Confidence Bound

Namespaces

More

Page actions

Contents

Background

The UCB1 algorithm

Pseudocode

Theoretical properties

Variants

UCB2

UCB1-Tuned

KL-UCB

Bayesian UCB (Bayes-UCB)

Contextual UCB (e.g., LinUCB)

Applications

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Upper Confidence Bound

Background

The UCB1 algorithm

Pseudocode

Theoretical properties

Variants

UCB2

UCB1-Tuned

KL-UCB

Bayesian UCB (Bayes-UCB)

Contextual UCB (e.g., LinUCB)

Applications

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories