Tsallis-INF Multi-Armed Bandit

Updated 17 November 2025

Tsallis-INF is an FTRL algorithm that leverages Tsallis entropy with exponent 1/2 to optimally adapt to both adversarial and stochastic environments.
It employs reduced-variance loss estimators and a learning rate of 4/√t to balance implicit exploration with precise loss minimization.
The algorithm guarantees O(√(KT)) regret in adversarial settings and O(∑(log T)/Δ_i) regret in stochastic regimes, offering best-of-both-worlds performance.

The Tsallis-INF multi-armed bandit algorithm is a Follow-The-Regularized-Leader (FTRL) approach employing Tsallis entropy with exponent $\alpha=1/2$ as a regularizer. It is designed to attain optimal (within constants) regret bounds both in adversarial and stochastic bandit environments without prior knowledge of the regime or time horizon. Unique among bandit algorithms, Tsallis-INF guarantees $\mathcal{O}(\sqrt{KT})$ regret in the adversarial setting and $\mathcal{O}(\sum_{i\ne i^*} (\log T)/\Delta_i)$ regret in the stochastic setting, including various forms of stochastic-adversarial interpolation with adversarial corruptions, thus displaying "best-of-both-worlds" performance (Zimmert et al., 2018, Lee et al., 14 Nov 2025, Masoudian et al., 2021).

1. Algorithmic Principles and Formulation

Tsallis-INF operates in the standard $K$ -armed bandit setting where, at each round $t=1,\dots,T$ , a distribution $w_t$ over arms is selected, an arm $I_t \sim w_t$ is drawn, and the corresponding loss $\ell_{t,I_t}\in[0,1]$ is observed.

The regularizer is the Tsallis entropy of order $1/2$: $\Psi(w) = 4\sum_{i=1}^{K}(\sqrt{w_i} - \tfrac{1}{2}w_i)$ At each round, with a learning rate $\eta_t$ , the FTRL subproblem is solved: $w_t = \arg\max_{w\in\Delta^{K-1}}\left\{ -\langle \hat{L}_{t-1}, w \rangle - \Psi_t(w) \right\}, \quad \Psi_t(w)=\eta_t^{-1}\Psi(w)$ where $\hat{L}_{t-1}$ is the cumulative vector of reduced-variance loss estimates.

Reduced-variance loss estimation (cf. Audibert-Bubeck, Zimmert-Seldin) is used to control estimator variance: $B_t(i) = \tfrac{1}{2}[\![w_{t,i} \ge \eta_t^2]\!], \qquad \hat\ell_{t,i} = \mathbb{I}[I_t = i] \cdot (\ell_{t,i} - B_t(i)) / (w_{t,i} + B_t(i))$

The learning rate schedule is set as $\eta_t = 4/\sqrt{t}$ , which is critical for simultaneously achieving both $\mathcal{O}(\sqrt{T})$ adversarial and $\mathcal{O}(\log T)$ stochastic regret.

2. Regret Guarantees across Bandit Regimes

Tsallis-INF yields tight regret bounds in a variety of regimes, most notably:

Adversarial regime: For $T$ time steps and $K$ arms, the expected regret satisfies

$\mathrm{Reg}_T \leq 2\sqrt{(K-1)T} + \tfrac{1}{2}\sqrt{T} + O(K\log T)$

This matches the minimax lower bound up to constants.

Stochastic regime: When losses are i.i.d. with unique optimal arm $i^*$ and gaps $\Delta_i > 0$ ( $i\neq i^*$ ), the regret is: $\mathrm{Reg}_T = O\left(\Big(\sum_{i\ne i^*}\frac{1}{\Delta_i}\Big)\log_+\frac{(K-1)T}{(\sum_{i\ne i^*}1/\Delta_i)^2}\right)$ Subsumed herein is the classical $O(\sum_{i\ne i^*} \frac{\log T}{\Delta_i})$ dependence.
Corrupted stochastic regime (adversarial corruptions of magnitude $C$ ):

$\mathrm{Reg}_T = O\left(\sum_{i\ne i^*}\frac{1}{\Delta_i} \log_+\frac{(K-1)T}{(\sum_{i\ne i^*}1/\Delta_i)^2} + \sqrt{C\left(\sum_{i\ne i^*}\frac{1}{\Delta_i}\right)\log_+\frac{(K-1)T}{C\sum_{i\ne i^*}1/\Delta_i}} \right)$

This guarantees smooth interpolation between logarithmic and sublinear regret rates depending on the corruption level (Masoudian et al., 2021).

The self-bounding constraint formalism subsumes all these regimes, enabling this unification. When $C=0$ , the bound specializes to the stochastic setting; when $C\sim T$ , adversarial scaling is recovered.

3. Implementation and Theoretical Structure

Tsallis-INF is an FTRL algorithm with a non-Euclidean regularizer: $w_t = \arg\max_{w\in\Delta^{K-1}}\left\{ -\langle \hat{L}_{t-1}, w \rangle - \Psi_t(w) \right\}$ This update is implemented by evaluating the convex conjugate of $\Psi_t + I_\Delta$ , though in recent analyses (Lee et al., 14 Nov 2025) the update rule and regret can be analyzed and implemented directly via spectral properties of the Tsallis entropy, avoiding explicit Fenchel conjugation.

Reduced-variance loss estimators, using bias correction, ensure variance remains controlled even for small probability arms:

When $w_{t,i}$ is small, $B_t(i)>0$ increases the denominator in $\hat\ell_{t,i}$ , improving stability.

Choice of $\alpha=1/2$ is essential: only this exponent achieves both adversarial and stochastic optimality in the bandit setting, due to the interplay between implicit exploration and loss minimization enforced by the square-root regularizer.

The learning rate $\eta_t = 4/\sqrt{t}$ is determined by calibration between the local-norm penalty and telescoping potential differences inherent in the FTRL local-norm regret lemma.

4. Analytical Techniques and Proof Innovations

Recent proofs (Lee et al., 14 Nov 2025, Masoudian et al., 2021) eschew Fenchel conjugate arguments in favor of local-norm analyses rooted in modern online convex optimization theory. The key lemma (FTRL local-norm regret bound) yields: $\sum_{t=1}^T \langle g_t, w_t - u\rangle \leq \Psi_{T+1}(u) - \min_w \Psi_1(w) + \tfrac{1}{2}\sum_{t=1}^T \|g_t\|_{(\nabla^2 \Psi_t(z_t))^{-1}}^2 + \sum_{t=1}^T[\Psi_t(w_{t+1}) - \Psi_{t+1}(w_{t+1})]$ For the Tsallis entropy in question, $\nabla^2\Psi_t(x) = \mathrm{diag}(\frac{\lambda_t}{4} x_i^{-3/2})$ , and the spectral inverse is correspondingly simple. Analysis leverages the fact that the variance of reduced-variance losses can be upper bounded in terms of $w_{t,i}^{1/2}$ , allowing tight control over variance terms and summation by integrals bounded by $O(\sqrt{T})$ or $O(\log T)$ as appropriate.

In the self-bounding setting, the analysis further leverages constrained optimization over per-arm probabilities, producing refined logarithmic terms via tight single-step optimizations with quadratic-linear objectives (solved in closed form).

5. Comparative Landscape and Extensions

Tsallis-INF achieves a unique position among bandit algorithms:

EXP3 achieves minimax adversarial regret but not stochastic optimality.
UCB achieves stochastic optimality but is linear in adversarial settings.
Thompson Sampling exhibits near-linear regret in adversarial environments (Zimmert et al., 2018).

Tsallis-INF, by contrast, universally adapts to the underlying regime. Its self-bounding analysis generalizes to stochastic MDPs (episodic settings) and other online learning frameworks where regret bounds of the form $B\sum_{t,i\ne i^*}\sqrt{\mathbb{E}[w_{t,i}]/t} + D$ apply.

A summary comparison is given in the following table:

Algorithm	Adversarial Regret	Stochastic Regret
UCB1	$\Omega(T)$	$O(\sum 1/\Delta_i \log T)$
EXP3	$O(\sqrt{KT})$	$O(\sqrt{KT})$
Thompson Sampling	$\Omega(T)$	$O(\sum 1/\Delta_i \log T)$
Tsallis-INF	$O(\sqrt{KT})$	$O(\sum 1/\Delta_i \log T)$

6. Parameter Choices, Constants, and Practical Implications

The critical parameters are:

Exponent $\alpha=1/2$ : Unique for best-of-both-worlds optimality.
Learning rate $\eta_t = 4/\sqrt{t}$ : Required for balancing exploration-exploitation.
Constants: Non-optimized in several analyses (e.g., 32, 256 in bounds), with the focus on tightness of scaling laws rather than sharp constants.

No explicit prior knowledge of $T$ , regime, or loss distribution is needed. The analysis and implementation are robust to these unknowns.

Despite non-optimized constants, empirical evaluation shows Tsallis-INF significantly outperforms UCB1 and EXP3 in stochastic settings, and does not suffer catastrophic linear regret in adversarial settings, unlike Thompson Sampling (Zimmert et al., 2018).

7. Extensions and Generalizations

The Tsallis-INF approach extends through the self-bounding formalism to a range of bandit and online learning problems, including:

Stochastically constrained adversarial MABs,
Stochastic MABs with adversarial data corruptions,
Utility-based dueling bandits,
Best-of-both-worlds reinforcement learning in MDPs (Masoudian et al., 2021).

The optimization-based regret bounds allow direct application to any algorithm with matching potential-variance trade-offs, suggesting broader applicability in stochastic control and reinforcement learning settings.

A plausible implication is that further generalizations to structured action spaces or contextual bandits may be possible by modifying the regularizer and estimator structure accordingly, while retaining optimal interpolation between adversarial and stochastic regimes.

PDF Markdown Chat (Pro)

References (3)

Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits (2018)

A Best-of-Both-Worlds Proof for Tsallis-INF without Fenchel Conjugates (2025)

Improved Analysis of the Tsallis-INF Algorithm in Stochastically Constrained Adversarial Bandits and Stochastic Bandits with Adversarial Corruptions (2021)

Follow Topic

Get notified by email when new papers are published related to Tsallis-INF Multi-Armed Bandit Algorithm.