Bernoulli Bandit Problem

Updated 25 February 2026

The Bernoulli Bandit Problem is a sequential decision-making framework where each arm yields a binary reward with unknown success probabilities, posing a challenge in balancing exploration and exploitation.
The framework provides rigorous theoretical guarantees including logarithmic regret lower bounds, such as the Lai–Robbins bound, which guide the development of optimal sampling algorithms.
Methods like KL-UCB, Thompson Sampling, and forced exploration strategies are employed to minimize cumulative regret, demonstrating the problem's significance in online learning and adaptive optimization.

The Bernoulli Bandit Problem is a central formulation within sequential decision theory and online learning, in which a learner sequentially selects from a set of arms—each associated with an unknown success probability—and observes a binary (Bernoulli) reward upon each pull. The primary objective is typically to maximize cumulative reward or equivalently minimize regret relative to always playing the optimal arm. This framework underpins theoretical studies of exploration–exploitation trade-offs and serves as the canonical testbed for the analysis of regret minimization algorithms, with connections to Bayesian design, empirical process theory, and high-dimensional statistics. The Bernoulli bandit appears in both finite- and infinite-armed settings, with implications for the theory and practice of online optimization, statistical experiment design, reinforcement learning, and adaptive sampling.

1. Formal Problem Definition and Canonical Models

In the classical K-armed Bernoulli bandit, each arm $i \in \{1,...,K\}$ yields i.i.d. rewards $X_{i,t} \sim \mathrm{Bernoulli}(\theta_i)$ , with $\theta_i \in [0,1]$ unknown. The learner selects an arm at each round $t=1,2,\dots,T$ , observes the binary reward, and accumulates total reward or regret. The cumulative regret—typically used as the primary performance metric—is

$R_T = \mathbb{E}\left[ \sum_{t=1}^T (\theta^* - \theta_{A_t}) \right] = \sum_{i: \theta_i < \theta^*} \Delta_i\, \mathbb{E}[N_i(T)],$

where $\theta^* = \max_i \theta_i$ , $\Delta_i = \theta^* - \theta_i$ , and $N_i(T)$ is the number of times arm $i$ is pulled up to time $T$ (Baudry et al., 2020, Qi et al., 2023). In the infinite-armed setting, the $\{\theta_i\}$ are drawn from a prior distribution and exploration versus exploitation involves managing a trade-off between sampling new arms and exploiting those already explored (Chan et al., 2018).

The Bayesian variant specifies priors (e.g., Beta distributions) over $\theta_i$ and considers Bayes-optimal and heuristic solutions in both finite- and infinite-horizon scenarios (Jacko, 2019).

2. Regret Theory: Asymptotics, Minimax, and Lower Bounds

For the finite-armed stochastic Bernoulli bandit, the minimax optimal regret is known to be logarithmic in the time horizon. The Lai–Robbins lower bound states that

$\mathbb{E}[N_i(T)] \geq \frac{\ln T}{d(\theta_i,\theta^*)} + o(\ln T)$

for each suboptimal arm $i$ , where $d(p,q)$ is the KL-divergence for Bernoulli distributions (Baudry et al., 2020). The corresponding regret lower bound is thus

$R_T \geq \sum_{i: \Delta_i > 0} \frac{\Delta_i}{d(\theta_i, \theta^*)} \ln T + o(\ln T).$

In infinite-arm models with a prior $g(\mu) \propto \mu^{\beta-1}$ near zero, the minimal Bayesian expected regret is of order $n^{\beta/(\beta+1)}$ (for horizon $n$ ), with explicit constants depending on the prior (Chan et al., 2018).

For symmetric two-armed Bernoulli bandits with mean sum one, minimax regret for various “gap” regimes is characterized sharply using diffusion/PDE approximations; in the small-gap regime, $R_T^* \sim (1/\sqrt{\pi})\sqrt{T} \approx 0.564 \sqrt{T}$ , while for fixed nonvanishing gap, $R_T^* \sim (1/2\epsilon)$ for gap $\Delta=2\epsilon$ (Kobzar et al., 2022).

3. Algorithmic Methods and Regret Bounds

Classical and Non-Parametric Algorithms

KL-UCB achieves the minimax logarithmic regret with the leading constant matching the Lai–Robbins lower bound. For Bernoulli arms, the confidence intervals are based on the KL-divergence between empirical and true means, not just subgaussian tail bounds (Baudry et al., 2020).
Thompson Sampling with Beta prior is asymptotically optimal in Bernoulli bandits, matching KL-UCB up to constant factors (Baudry et al., 2020).
Random-Block Sub-sampling Duelling Algorithm (RB-SDA) is distribution-free and matches the leading KL-divergence term in the regret for Bernoulli bandits without requiring prior knowledge or distribution-specific tuning (Baudry et al., 2020).
Forced Exploration methods alternate between greedy exploitation and schedule-driven forced arm draws, achieving regret scaling as $O((\log T)^2/\Delta_i)$ for Bernoulli arms with a nonparametric, distribution-blind implementation (Qi et al., 2023).

Parametric, Structured, and Contextual Models

Parametric (linear–logistic) Bernoulli bandits: When arm means are of the form $\theta_i = \sigma(a_i^\top \theta)$ with known $a_i$ , the Two-Phase Algorithm achieves regret $O(n\,f(T))$ for $n$ -dimensional parameters and finite arms, independent of the number of arms $m$ for suitably chosen schedules $f(T)$ , and $O(\sqrt{n^3T})$ in infinite-armed settings (Jiang et al., 2011).

Bayes-Optimal and Myopic/Greedy Policies

Bayesian dynamic programming yields the optimal policy for known Beta priors in the finite-horizon case, tractable up to $T\sim 10^3$ via state space enumeration (Jacko, 2019).
Myopic strategies: For Bernoulli two-armed bandits, the greedy strategy (pulling the arm with the highest posterior mean) is provably optimal for maximizing the expected number of total successes and any thresholded utility, under minimal conditions (Chen et al., 2022).

4. Generalizations: Nonstationarity, Infinite Arms, and Kernelized Bandits

Dynamic Bernoulli bandits: In settings where arm means $\mu_t(i)$ drift or switch over time, algorithms with adaptive estimation (e.g., adaptive forgetting factors in sample mean or Bayesian estimators) outperform static methods. Adaptive-UCB, adaptive- $\epsilon$ -Greedy, and adaptive Thompson sampling show improved empirical performance without prior knowledge of change-rates (Lu et al., 2017). Theoretical guarantees for dynamic regret remain limited.

Infinite-armed Bernoulli bandits: Under power-law priors on arm means, the Confidence-Bound-Target (CBT) algorithm achieves asymptotic optimality by pruning arms based on empirical upper confidence bounds and a time-dependent threshold $\zeta_n$ derived from the prior. Regret matches the information-theoretic lower bound $O(n^{\beta/(\beta+1)})$ (Chan et al., 2018).

Kernelized Bernoulli bandits: For correlated arms with $f \in \mathcal{H}_k$ (RKHS), the standard GP-UCB framework using subgaussian assumptions yields regret $\widetilde{O}(B\sqrt{\gamma_T} + \sqrt{T \gamma_T})$ , where $\gamma_T$ is the information gain up to time $T$ . However, these methods do not exploit Bernoulli-specific KL divergence for sharper, instance-dependent rates. An open problem is to construct KL-based confidence bounds and UCB algorithms that recover $\mathcal{O}(\ln T)$ instance-dependent regret rates in kernelized settings (Mussi et al., 2024).

5. Extensions, Open Problems, and Computational Aspects

Significant Open Problems

Kernelized Bernoulli bandits: No instance-dependent, KL-based high-probability confidence bounds analogous to KL-UCB are known for $f \in \mathcal{H}_k$ . Attaining $\mathcal{O}(\ln T)$ regret in structured (kernelized) Bernoulli bandits is an open problem; current approaches default to subgaussian analyses which are minimax-optimal but not instance-optimal (Mussi et al., 2024).
Bernoulli sampling with Gaussian imagination: Using Gaussian belief updates incurs at most $O(\sqrt{T})$ excess regret for sufficiently diffuse priors/likelihoods, formalizing the intuition that Gaussian Bayesian agents perform robustly under Bernoulli noise (Liu et al., 2022).

Computational Considerations

Full dynamic programming for finite-horizon, finite-arm Bernoulli bandits is tractable only up to moderate horizons ( $T \lesssim 1440$ for offline computation) due to cubic growth in the posterior state space (Jacko, 2019).
Empirical CBT and other algorithms that leverage observable statistics rather than parametric priors are effective in practice when prior information is unavailable or misspecified (Chan et al., 2018).

6. Myths and Misconceptions in the Bernoulli Bandit Literature

Empirical and theoretical studies reveal several prevalent misconceptions:

Bayes-optimal DP is tractable for far larger horizons than often assumed, especially with modern computational tools.
The Gittins index rule does not suffer from incomplete learning in undiscouted, finite-horizon Bernoulli bandits.
UCB1 with $\alpha=2$ is not near-optimal for finite-horizon Bernoulli bandits, typically incurring 3–5 $\times$ the regret of Bayes-optimal or suitably tuned variants.
Regret need not always grow monotonically with $T$ ; under some configurations, it can plateau or even decrease with $T$ in moderate size problems (Jacko, 2019).

7. Summary of Algorithmic and Theoretical Landscape

Setting	Regret Lower Bound	Asymptotically Optimal Algorithms
Finite K, fixed means	$\sum_i \frac{\ln T}{d(\theta_i,\theta^*)}$	KL-UCB, Thompson Sampling, RB-SDA
Infinite arms, prior $g(\mu)$	$C n^{\beta/(\beta+1)}$	CBT (with known $g$ or empirical threshold)
Symmetric 2-arm, mean sum 1	$(1/\sqrt{\pi})\sqrt{T}$ (small gap)	PDE-based myopic policy
Kernelized arms ( $f \in \mathcal{H}_k$ )	$\widetilde{O}(B\sqrt{\gamma_T}+\sqrt{T\gamma_T})$	GP-UCB (subgaussian bound); open for KL-UCB analogues
Nonstationary Bernoulli bandit	No tight lower bound available	Adaptive-UCB, Adaptive-TS, AFF estimators

The Bernoulli bandit problem thus provides a rigorous yet flexible substrate for the advanced study of sequential decision-making, with a rich landscape of optimality regimes, algorithmic innovations, and open theoretical challenges spanning nonparametric settings, structural exploitation, and adaptivity to drift and model mismatch. Recent research directions include kernelized Bernoulli bandits, parametric structure exploitation, and robust algorithms for model-misspecified or dynamic environments (Mussi et al., 2024, Jiang et al., 2011, Lu et al., 2017, Liu et al., 2022).