Tsallis-INF Multi-Armed Bandit
- Tsallis-INF is an FTRL algorithm that leverages Tsallis entropy with exponent 1/2 to optimally adapt to both adversarial and stochastic environments.
- It employs reduced-variance loss estimators and a learning rate of 4/√t to balance implicit exploration with precise loss minimization.
- The algorithm guarantees O(√(KT)) regret in adversarial settings and O(∑(log T)/Δ_i) regret in stochastic regimes, offering best-of-both-worlds performance.
The Tsallis-INF multi-armed bandit algorithm is a Follow-The-Regularized-Leader (FTRL) approach employing Tsallis entropy with exponent as a regularizer. It is designed to attain optimal (within constants) regret bounds both in adversarial and stochastic bandit environments without prior knowledge of the regime or time horizon. Unique among bandit algorithms, Tsallis-INF guarantees regret in the adversarial setting and regret in the stochastic setting, including various forms of stochastic-adversarial interpolation with adversarial corruptions, thus displaying "best-of-both-worlds" performance (Zimmert et al., 2018, Lee et al., 14 Nov 2025, Masoudian et al., 2021).
1. Algorithmic Principles and Formulation
Tsallis-INF operates in the standard -armed bandit setting where, at each round , a distribution over arms is selected, an arm is drawn, and the corresponding loss is observed.
The regularizer is the Tsallis entropy of order $1/2$: At each round, with a learning rate , the FTRL subproblem is solved: where is the cumulative vector of reduced-variance loss estimates.
Reduced-variance loss estimation (cf. Audibert-Bubeck, Zimmert-Seldin) is used to control estimator variance:
The learning rate schedule is set as , which is critical for simultaneously achieving both adversarial and stochastic regret.
2. Regret Guarantees across Bandit Regimes
Tsallis-INF yields tight regret bounds in a variety of regimes, most notably:
- Adversarial regime: For time steps and arms, the expected regret satisfies
This matches the minimax lower bound up to constants.
- Stochastic regime: When losses are i.i.d. with unique optimal arm and gaps (), the regret is: Subsumed herein is the classical dependence.
- Corrupted stochastic regime (adversarial corruptions of magnitude ):
This guarantees smooth interpolation between logarithmic and sublinear regret rates depending on the corruption level (Masoudian et al., 2021).
The self-bounding constraint formalism subsumes all these regimes, enabling this unification. When , the bound specializes to the stochastic setting; when , adversarial scaling is recovered.
3. Implementation and Theoretical Structure
Tsallis-INF is an FTRL algorithm with a non-Euclidean regularizer: This update is implemented by evaluating the convex conjugate of , though in recent analyses (Lee et al., 14 Nov 2025) the update rule and regret can be analyzed and implemented directly via spectral properties of the Tsallis entropy, avoiding explicit Fenchel conjugation.
Reduced-variance loss estimators, using bias correction, ensure variance remains controlled even for small probability arms:
- When is small, increases the denominator in , improving stability.
Choice of is essential: only this exponent achieves both adversarial and stochastic optimality in the bandit setting, due to the interplay between implicit exploration and loss minimization enforced by the square-root regularizer.
The learning rate is determined by calibration between the local-norm penalty and telescoping potential differences inherent in the FTRL local-norm regret lemma.
4. Analytical Techniques and Proof Innovations
Recent proofs (Lee et al., 14 Nov 2025, Masoudian et al., 2021) eschew Fenchel conjugate arguments in favor of local-norm analyses rooted in modern online convex optimization theory. The key lemma (FTRL local-norm regret bound) yields: For the Tsallis entropy in question, , and the spectral inverse is correspondingly simple. Analysis leverages the fact that the variance of reduced-variance losses can be upper bounded in terms of , allowing tight control over variance terms and summation by integrals bounded by or as appropriate.
In the self-bounding setting, the analysis further leverages constrained optimization over per-arm probabilities, producing refined logarithmic terms via tight single-step optimizations with quadratic-linear objectives (solved in closed form).
5. Comparative Landscape and Extensions
Tsallis-INF achieves a unique position among bandit algorithms:
- EXP3 achieves minimax adversarial regret but not stochastic optimality.
- UCB achieves stochastic optimality but is linear in adversarial settings.
- Thompson Sampling exhibits near-linear regret in adversarial environments (Zimmert et al., 2018).
Tsallis-INF, by contrast, universally adapts to the underlying regime. Its self-bounding analysis generalizes to stochastic MDPs (episodic settings) and other online learning frameworks where regret bounds of the form apply.
A summary comparison is given in the following table:
| Algorithm | Adversarial Regret | Stochastic Regret |
|---|---|---|
| UCB1 | ||
| EXP3 | ||
| Thompson Sampling | ||
| Tsallis-INF |
6. Parameter Choices, Constants, and Practical Implications
The critical parameters are:
- Exponent : Unique for best-of-both-worlds optimality.
- Learning rate : Required for balancing exploration-exploitation.
- Constants: Non-optimized in several analyses (e.g., 32, 256 in bounds), with the focus on tightness of scaling laws rather than sharp constants.
No explicit prior knowledge of , regime, or loss distribution is needed. The analysis and implementation are robust to these unknowns.
Despite non-optimized constants, empirical evaluation shows Tsallis-INF significantly outperforms UCB1 and EXP3 in stochastic settings, and does not suffer catastrophic linear regret in adversarial settings, unlike Thompson Sampling (Zimmert et al., 2018).
7. Extensions and Generalizations
The Tsallis-INF approach extends through the self-bounding formalism to a range of bandit and online learning problems, including:
- Stochastically constrained adversarial MABs,
- Stochastic MABs with adversarial data corruptions,
- Utility-based dueling bandits,
- Best-of-both-worlds reinforcement learning in MDPs (Masoudian et al., 2021).
The optimization-based regret bounds allow direct application to any algorithm with matching potential-variance trade-offs, suggesting broader applicability in stochastic control and reinforcement learning settings.
A plausible implication is that further generalizations to structured action spaces or contextual bandits may be possible by modifying the regularizer and estimator structure accordingly, while retaining optimal interpolation between adversarial and stochastic regimes.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free