Upper Confidence Bound

From HandWiki
Upper Confidence Bound (UCB)
ClassMulti-armed bandit; Reinforcement learning
Data structureSequential reward observations
Worst-case performanceO(K) per round (K = number of arms)
Average performanceO(K)
Worst-case space complexityO(K)
Short description: Family of machine learning algorithms for bandit problems


Upper Confidence Bound (UCB) is a family of algorithms in machine learning and statistics for solving the multi-armed bandit problem and addressing the exploration–exploitation trade-off. UCB methods select actions by computing an upper confidence estimate of each action’s potential reward, thus balancing exploration of uncertain options with exploitation of those known to perform well. Confidence-bound methods for stochastic bandits date back to work of Tze Leung Lai and Herbert Robbins in 1985, and the first UCB algorithm is due to Lai (1987).[1][2][3] UCB and its variants have become standard techniques in reinforcement learning, online advertising, recommender systems, clinical trials, and Monte Carlo tree search.[4][5][6]

Background

The multi-armed bandit problem models a scenario where an agent chooses repeatedly among K options ("arms"), each yielding stochastic rewards, with the goal of maximizing the sum of collected rewards over time. The main challenge is the exploration–exploitation trade-off: the agent must explore lesser-tried arms to learn their rewards, yet exploit the best-known arm to maximize payoff.[6] Traditional ε-greedy or softmax strategies use randomness to force exploration; UCB algorithms instead use statistical confidence bounds to guide exploration more efficiently.[5]

The UCB1 algorithm

Error creating thumbnail:
Behaviour of an UCB algorithm on a bandit run

UCB1 is a widely used bounded-reward variant of UCB introduced by Auer, Cesa-Bianchi and Fischer (2002).[4] It maintains for each arm i:

  • the empirical mean reward μ^i,
  • the count ni of times arm i has been played.

At round t, it selects the arm maximizing:

UCB1i(t)=μ^i+2lntni

Arms with ni=0 are initially played once. The bonus term 2lnt/ni shrinks as ni grows, ensuring exploration of less-tried arms and exploitation of high-mean arms.[4]

Pseudocode

for each arm i:
    n[i] ← 0; Q[i] ← 0
for t from 1 to T do:
    for each arm i do
        if n[i] = 0 then
            select arm i
        else
            index[i] ← Q[i] + sqrt((2 * ln t) / n[i])
    select arm a with highest index[a]
    observe reward r
    n[a] ← n[a] + 1
    Q[a] ← Q[a] + (r - Q[a]) / n[a]

Theoretical properties

Auer et al. proved that UCB1 achieves logarithmic regret: after n rounds, the expected regret R(n) satisfies

R(n)=O(i:Δi>0lnnΔi),

where Δi is the gap between the optimal arm’s mean and arm i’s mean. Thus, average regret per round tend to 0 as n+, and UCB1 is near-optimal against the Lai-Robbins lower bound.[4][1]

Variants

Several extensions improve or adapt UCB to different settings:

UCB2

Introduced in the same paper as UCB1, UCB2 divides plays into epochs controlled by a parameter α, reducing the constant in the regret bound at the cost of more complex scheduling.[4]

UCB1-Tuned

Incorporates empirical variance Vi to tighten the bonus: μ^i+lntnimin{1/4,Vi}. This often outperforms UCB1 in practice but lacks a simple regret proof.[4]

Replaces Hoeffding’s bound with a Kullback–Leibler divergence condition, yielding asymptotically optimal regret (constant = 1) for Bernoulli rewards.[7][8]

Bayesian UCB (Bayes-UCB)

Computes the (1δ)-quantile of a Bayesian posterior (e.g. Beta for Bernoulli) as the index. Proven asymptotically optimal under certain priors.[9]

Contextual UCB (e.g., LinUCB)

Extends UCB to contextual bandits by estimating a linear reward model and confidence ellipsoids in parameter space. Widely used in news recommendation.[10]

Applications

UCB algorithms’ simplicity and strong guarantees make them popular in:

  • Online advertising & A/B testing: adaptively allocate traffic to maximize conversion rates without fixed split ratios.[6]
  • Monte Carlo Tree Search: UCT uses UCB1 at each tree node to guide exploration in games like Go.[11][12]
  • Adaptive clinical trials: assign patients to treatments with highest upper confidence on success, improving outcomes over randomization.[13]
  • Recommender systems: personalized content selection under uncertainty.
  • Robotics & control: efficient exploration of unknown dynamics.

See also

References

  1. 1.0 1.1 Lai, Tze Leung; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8. 
  2. Lai, Tze Leung (1987). "Adaptive treatment allocation and the multi-armed bandit problem". The Annals of Statistics 15 (3): 1091–1114. doi:10.1214/AOS/1176350495. 
  3. Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. ISBN 978-1-108-48682-8. 
  4. 4.0 4.1 4.2 4.3 4.4 4.5 Auer, Peter; Cesa-Bianchi, Nicolo; Fischer, Paul (2002). "Finite-time Analysis of the Multiarmed Bandit Problem". Machine Learning 47: 235–256. doi:10.1023/A:1013689704352. https://link.springer.com/article/10.1023/A:1013689704352. 
  5. 5.0 5.1 Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN 978-0-262-03924-6. 
  6. 6.0 6.1 6.2 Bubeck, Sébastien; Cesa-Bianchi, Nicolo (2012). "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems". Foundations and Trends in Machine Learning 5 (1): 1–122. doi:10.1561/2200000024. 
  7. Garivier, Aurélien; Cappé, Olivier (2011). "The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond". 19. JMLR Workshop and Conference Proceedings. pp. 359–376. https://proceedings.mlr.press/v19/garivier11a.html. 
  8. Maillard, Olivier-Alain; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergence". 19. JMLR Workshop and Conference Proceedings. pp. 497–514. https://proceedings.mlr.press/v19/maillard11a.html. 
  9. Kaufmann, Emilie; Cappé, Olivier; Garivier, Aurélien (2012). "Bayesian Upper Confidence Bounds for Bandit Problems". 1. pp. 2177–85. 
  10. Li, Lihong; Chu, Wei; Langford, John; Schapire, Robert E. (2010). "A contextual-bandit approach to personalized news article recommendation". pp. 661–670. doi:10.1145/1772690.1772758. 
  11. Kocsis, László; Szepesvári, Csaba (2006). "Bandit based Monte-Carlo planning". pp. 282–293. doi:10.1007/11871842_29. 
  12. Silver, David; Huang, Aja; Maddison, Chris J. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature 529 (7587): 484–9. doi:10.1038/nature16961. PMID 26819042. Bibcode2016Natur.529..484S. 
  13. Safikhani, Maryam; Jo, Yuxin (2022). "Retrospective analysis indicates bandit approaches improve patient outcomes in clinical trials". BMC Medical Research Methodology 22: 117. doi:10.1186/s12874-022-01636-7.