Explore-then-commit algorithm

Short description: Algorithm for the multi-armed bandit problem

Explore Then Commit (ETC) is an algorithm for the multi-armed bandit problem foc,used on finding the best trade-off between exploration and exploitation.

Multi-armed bandit problem

The multi-armed bandit problem is a sequential game where one player has to choose at each turn between $K$ actions (arms). Behind every arm $a$ is an unknown distribution $ν_{a}$ that lies in a set $𝒟$ known by the player (for example, $𝒟$ can be the set of Gaussian distributions or Bernoulli distributions).

At each turn $t$ the player chooses (pulls) an arm $a_{t}$ , they then get an observation $X_{t}$ of the distribution $ν_{a_{t}}$ .

Regret minimization

The goal is to minimize the regret at time $T$ that is defined as

R_{T} : = \sum_{a = 1}^{K} Δ_{a} 𝔼 [N_{a} (T)]

where

$μ_{a} : = 𝔼 [ν_{a}]$ is the mean of arm $a$
$μ^{*} : = \max_{a} μ_{a}$ is the highest mean
$Δ_{a} : = μ^{*} - μ_{a}$
$N_{a} (t)$ is the number of pulls of arm $a$ up to turn $t$

The player has to find an algorithm that chooses at each turn $t$ which arm to pull based on the previous actions and observations $(a_{s}, X_{s})_{s < t}$ to minimize the regret $R_{T}$ .

This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).^[1]

Algorithm

Two runs of ETC with the same M = 10. On the first run it does manage to find the best arm after the exploration while it does not on the second run

The algorithm explores each arm $M$ times. For the rest of the game the algorithm exploits its discoveries by playing the arm with the highest mean. If the horizon $T$ is known, then the number of explorations $M$ can depend on $T$ .

Adaptations of the algorithm exist^[2] and can be found in the literature for other settings.^[3]

Pseudocode

The player chooses M
 for each arm i do:
    select arm i M times
    update empirical mean mu[i]
for t from MK+1 to T do:
    select arm a with highest empirical mean mu[a]

Theoretical results

Trade of between exploration (large M) and exploitation (small M) for ETC

When all arms are $1$ -sub gaussian, by choosing to explore each arm $M$ times, the regret at time $T$ verify

R_{T} \leq M \sum_{i = 1}^{K} Δ_{i} + (T - M K) \sum_{i = 1}^{K} Δ_{i} \exp (- \frac{M Δ_{i}^{2}}{4})

^[1]

the first term is considered the cost of the exploration

M \sum_{i = 1}^{K} Δ_{i}

.

The second term is the cost of not having explored enough, leading to a probability of not having an optimal arm as the arm with the highest empirical mean.

(T - M K) \sum_{i = 1}^{K} Δ_{i} \exp (- \frac{M Δ_{i}^{2}}{4})

Increasing $M$ increases the first term while decreasing the second term. The best possible $M$ must depend on the $(Δ_{i})_{i}$ which is unknown by the player.

For two arms with Gaussian distribution of variance $1$ , it was proved that ETC cannot achieve the asymptotic optimal regret of the Equation of Lai-Robbins.^[4]

References

↑ ^1.0 ^1.1 Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. doi:10.1017/9781108571401.
↑ Jin, Tianyuan; Xu, Pan; Xiao, Xiaokui; Gu, Quanquan (2021). "Double Explore-then-Commit: Asymptotic Optimality and Beyond". Proceedings of Machine Learning Research. pp. 2584–2633.
↑ Nie, Guanyu; Agarwal, Mridul; Umrawal, Abhishek Kumar; Aggarwal, Vaneet; Quinn, Christopher John (2022). "An Explore-then-Commit Algorithm for Submodular Maximization under Full-Bandit Feedback". Proceedings of Machine Learning Research. pp. 1541–1551.
↑ Garivier, Aurélien; Kaufmann, Emilie; Lattimore, Tor (2016). "On Explore-Then-Commit Strategies".

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Explore-then-commit algorithm. Read more

[LattimoreSzepesvari2020-1] 1.0 ^1.1 Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge University Press. doi:10.1017/9781108571401.

[JinXuXiaoGu2021-2] Jin, Tianyuan; Xu, Pan; Xiao, Xiaokui; Gu, Quanquan (2021). "Double Explore-then-Commit: Asymptotic Optimality and Beyond". Proceedings of Machine Learning Research. pp. 2584–2633.

[NieAgarwalUmrawalAggarwalQuinn2022-3] Nie, Guanyu; Agarwal, Mridul; Umrawal, Abhishek Kumar; Aggarwal, Vaneet; Quinn, Christopher John (2022). "An Explore-then-Commit Algorithm for Submodular Maximization under Full-Bandit Feedback". Proceedings of Machine Learning Research. pp. 1541–1551.

[GarivierKaufmannLattimore2016-4] Garivier, Aurélien; Kaufmann, Emilie; Lattimore, Tor (2016). "On Explore-Then-Commit Strategies".

[1]

[2]

[3]

[4]

Anonymous

Search

Explore-then-commit algorithm

Namespaces

More

Page actions

Contents

Multi-armed bandit problem

Regret minimization

Algorithm

Pseudocode

Theoretical results

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Explore-then-commit algorithm

Multi-armed bandit problem

Regret minimization

Algorithm

Pseudocode

Theoretical results

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories