Why Most Optimism Bandit Algorithms Have the Same Regret Analysis: A Simple Unifying Theorem (2512.18409v1)

Published 20 Dec 2025 in cs.LG, eess.SY, and stat.ML

Abstract: Several optimism-based stochastic bandit algorithms -- including UCB, UCB-V, linear UCB, and finite-arm GP-UCB -- achieve logarithmic regret using proofs that, despite superficial differences, follow essentially the same structure. This note isolates the minimal ingredients behind these analyses: a single high-probability concentration condition on the estimators, after which logarithmic regret follows from two short deterministic lemmas describing radius collapse and optimism-forced deviations. The framework yields unified, near-minimal proofs for these classical algorithms and extends naturally to many contemporary bandit variants.

Summary

The paper introduces a unifying theorem by establishing a minimal high-probability concentration condition on arm reward estimators.
It shows that a deterministically triggered 'radius collapse' after enough samples guarantees logarithmic regret in various bandit settings.
The analysis applies uniformly to algorithms like UCB, Linear UCB, and GP-UCB, highlighting the central role of concentration inequalities.

Unifying Regret Analysis of Optimism-Based Bandit Algorithms

Introduction

This paper offers a rigorous unification of regret analyses for a broad suite of optimism-based stochastic bandit algorithms, including UCB, UCB-V, Linear UCB, and finite-arm GP-UCB. The main contribution is the identification of a minimal and sufficient condition—a high-probability concentration inequality—for the relevant confidence estimators. Once this condition is established, a deterministic argument based on “radius collapse” and optimism-forced deviations universally yields logarithmic regret bounds. The framework is positioned as near-minimal, encompassing much of the classical and modern literature on stochastic multi-armed bandit (MAB) algorithms employing optimism in the face of uncertainty while clearly stating its boundaries relative to adversarial, contextual, and information-gain-based bandit scenarios.

Unified Framework: Concentration as the Linchpin

Central to the analysis is the explicit formulation of a high-probability concentration condition on the reward estimator for each arm, parameterized by a data-dependent confidence radius $r_i(m)$ . If, for each arm $i$ , the deviation $|\widehat{\mu}_i(m) - \mu_i|$ is bounded by $r_i(m)$ (with high probability over all $m \leq T$ ), the confidence radius systematically decays as the number of samples $m$ increases. The canonical form of $r_i(m)$ , reflecting Hoeffding/Bernstein/Freedman-style inequalities, is given by:

$r_i(m) = \sqrt{\frac{2\sigma_i^2 \log(1/\delta)}{m} + \frac{c_1 \log(1/\delta)}{m}}$

where $\sigma_i^2$ is a variance proxy and $c_1$ is a lower-order correction relevant for sharper bounds. This formulation unifies the empirical mean, variance-based UCB, linear bandit, and Gaussian process estimation settings under a common umbrella, provided the action set is finite and repeated arm selection yields confidence shrinkage.

Radius Collapse and Optimism-Forced Deviation: The Deterministic Engine

The analysis proceeds via two deterministic principles:

Radius Collapse: After $m_0 = O(\frac{\log(1/\delta)}{\Delta_i^2} + \frac{\log(1/\delta)}{\Delta_i})$ samples from a suboptimal arm $i$ (gap $\Delta_i$ ), the confidence radius $r_i(m)$ drops below a quarter of the gap, shrinking uncertainty sufficiently to “rule out” further exploration of that arm absent confidence violations.
Optimism-Forced Deviation: If a suboptimal arm is pulled after this threshold, then at least one confidence interval (for the suboptimal or optimal arm) must have been violated. On the high-probability event that all confidence intervals hold, such deviations do not occur, so each suboptimal arm is chosen only $O(\frac{\log T}{\Delta_i^2})$ times.

This argument, distilled in Theorem~\ref{thm:master}, foregoes the need for complex inductive or probabilistic recursion and operates once the stochastic input (the concentration inequality) is supplied.

Logarithmic Regret and Algorithmic Examples

The deterministic framework yields, for classic and structurally enriched bandit problems:

$\mathbb{E}[R_T] = O\left(\sum_{i:\Delta_i > 0} \frac{\log T}{\Delta_i}\right)$

where regret scales logarithmically with horizon $T$ and inverse-linearly with the reward gaps, hiding only universal and mild gap-dependent constants. This applies directly to:

UCB (Hoeffding): Standard bounded-reward MAB.
UCB-V and Empirical Bernstein UCB: Incorporating arm-dependent variance, with Freedman/Bernstein bounds.
Linear UCB: With finite arms, based on martingale self-normalized bounds for regression estimators.
Finite-Arm GP-UCB: Using RKHS posterior contraction for kernelized models.
Heteroskedastic, Heavy-Tailed, and ML-Assisted Bandits: Provided appropriate robust estimators and self-normalized concentration bounds are available, including median-of-means and surrogate-assisted estimators.
Misspecified Linear Bandits: With gap-proportional misspecification, leveraging the contraction of estimated mean with sufficient pulls.
Diversity-Assumed Contextual Bandits: A parameteric optimism-cascade argument achieves polylogarithmic regret.

In all these variants, once the confidence interval is justified (even with robust/empirical or surrogate-assisted estimators rather than plain empirical means), the remainder of the argument is deterministic and nearly identical.

Extensions: Randomized Indices and Limitations

The approach robustly extends to bandits with randomized index perturbations, such as FTPL and certain IDS implementations, as long as the perturbation magnitude can be uniformly controlled ( $\rho_T \to 0$ ). In this setting, regret guarantees persist by simply inflating the relevant confidence radius. However, standard Thompson sampling falls outside the strict applicability of this result, as posterior samples do not generally admit uniform perturbation controls.

Essentially, the framework is optimal for stochastic finite-armed settings with optimism-based index policies. It does not cover adversarial regimes, arbitrary contextual gaps, or continuous-action GP bandits reliant on information-theoretic information-gain analysis.

Implications and Future Directions

This work rigorously demystifies the widespread empirical observation that a breadth of optimism-based bandit algorithms yield similar regret rates and can be analyzed via “plug-and-play” deterministic arguments, provided the estimator’s high-probability deviation can be controlled. The result compels a reevaluation of algorithmic innovation: for finite-action problems, most variations in index construction or estimator type collapse to variations in constant factors or secondary terms. Consequently, genuine improvements in stochastic MAB require either advances in model structure (e.g., non-i.i.d. or non-stationary settings), tighter concentration for specific estimator classes, or extensions to continuous contextual or adversarial regimes.

Theoretically, the unification clarifies that the exploration-exploitation dilemma, for a wide class of problems, is governed almost entirely by concentration properties and not by higher-level algorithmic mechanics. Practically, this suggests that research effort is best spent on obtaining sharper, more tailored concentration inequalities for sophisticated or robust estimators, or on variants circumventing the limitations of the finite-action optimism paradigm.

Future work may extend this approach to bandit settings with infinite or structured action sets, with instance-dependent information-theoretic regret lower bounds, or may seek to further delineate the precise boundary at which optimism-based deterministic collapse ceases to yield tight minimax-rate guarantees. Additionally, there is significant scope to analyze how non-stationarity or adaptive process noise can be handled under similar unifying frameworks.

Conclusion

This paper rigorously demonstrates that a minimal high-probability confidence interval on arm-reward estimators suffices to reproduce the canonical logarithmic regret guarantees for nearly all optimism-based bandit algorithms with finite action sets. The key technical argument blends confidence radius collapse with optimism-forced deviation, enabling a deterministic and unified analysis. The framework’s extensibility to numerous practical variants (including robust, heteroskedastic, ML-augmented, and gap-misspecified models) clarifies the structural simplicity underlying much of classical and modern bandit theory. As such, it provides both a succinct blueprint for new algorithm analysis and a precise delineation of the essential stochastic-to-deterministic reduction at the heart of exploitation-vs-exploration tradeoffs in the stochastic MAB paradigm.