Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Tsallis-INF Multi-Armed Bandit

Updated 17 November 2025
  • Tsallis-INF is an FTRL algorithm that leverages Tsallis entropy with exponent 1/2 to optimally adapt to both adversarial and stochastic environments.
  • It employs reduced-variance loss estimators and a learning rate of 4/√t to balance implicit exploration with precise loss minimization.
  • The algorithm guarantees O(√(KT)) regret in adversarial settings and O(∑(log T)/Δ_i) regret in stochastic regimes, offering best-of-both-worlds performance.

The Tsallis-INF multi-armed bandit algorithm is a Follow-The-Regularized-Leader (FTRL) approach employing Tsallis entropy with exponent α=1/2\alpha=1/2 as a regularizer. It is designed to attain optimal (within constants) regret bounds both in adversarial and stochastic bandit environments without prior knowledge of the regime or time horizon. Unique among bandit algorithms, Tsallis-INF guarantees O(KT)\mathcal{O}(\sqrt{KT}) regret in the adversarial setting and O(ii(logT)/Δi)\mathcal{O}(\sum_{i\ne i^*} (\log T)/\Delta_i) regret in the stochastic setting, including various forms of stochastic-adversarial interpolation with adversarial corruptions, thus displaying "best-of-both-worlds" performance (Zimmert et al., 2018, Lee et al., 14 Nov 2025, Masoudian et al., 2021).

1. Algorithmic Principles and Formulation

Tsallis-INF operates in the standard KK-armed bandit setting where, at each round t=1,,Tt=1,\dots,T, a distribution wtw_t over arms is selected, an arm ItwtI_t \sim w_t is drawn, and the corresponding loss t,It[0,1]\ell_{t,I_t}\in[0,1] is observed.

The regularizer is the Tsallis entropy of order $1/2$: Ψ(w)=4i=1K(wi12wi)\Psi(w) = 4\sum_{i=1}^{K}(\sqrt{w_i} - \tfrac{1}{2}w_i) At each round, with a learning rate ηt\eta_t, the FTRL subproblem is solved: wt=argmaxwΔK1{L^t1,wΨt(w)},Ψt(w)=ηt1Ψ(w)w_t = \arg\max_{w\in\Delta^{K-1}}\left\{ -\langle \hat{L}_{t-1}, w \rangle - \Psi_t(w) \right\}, \quad \Psi_t(w)=\eta_t^{-1}\Psi(w) where L^t1\hat{L}_{t-1} is the cumulative vector of reduced-variance loss estimates.

Reduced-variance loss estimation (cf. Audibert-Bubeck, Zimmert-Seldin) is used to control estimator variance: Bt(i)=12[ ⁣[wt,iηt2] ⁣],^t,i=I[It=i](t,iBt(i))/(wt,i+Bt(i))B_t(i) = \tfrac{1}{2}[\![w_{t,i} \ge \eta_t^2]\!], \qquad \hat\ell_{t,i} = \mathbb{I}[I_t = i] \cdot (\ell_{t,i} - B_t(i)) / (w_{t,i} + B_t(i))

The learning rate schedule is set as ηt=4/t\eta_t = 4/\sqrt{t}, which is critical for simultaneously achieving both O(T)\mathcal{O}(\sqrt{T}) adversarial and O(logT)\mathcal{O}(\log T) stochastic regret.

2. Regret Guarantees across Bandit Regimes

Tsallis-INF yields tight regret bounds in a variety of regimes, most notably:

  • Adversarial regime: For TT time steps and KK arms, the expected regret satisfies

RegT2(K1)T+12T+O(KlogT)\mathrm{Reg}_T \leq 2\sqrt{(K-1)T} + \tfrac{1}{2}\sqrt{T} + O(K\log T)

This matches the minimax lower bound up to constants.

  • Stochastic regime: When losses are i.i.d. with unique optimal arm ii^* and gaps Δi>0\Delta_i > 0 (iii\neq i^*), the regret is: RegT=O((ii1Δi)log+(K1)T(ii1/Δi)2)\mathrm{Reg}_T = O\left(\Big(\sum_{i\ne i^*}\frac{1}{\Delta_i}\Big)\log_+\frac{(K-1)T}{(\sum_{i\ne i^*}1/\Delta_i)^2}\right) Subsumed herein is the classical O(iilogTΔi)O(\sum_{i\ne i^*} \frac{\log T}{\Delta_i}) dependence.
  • Corrupted stochastic regime (adversarial corruptions of magnitude CC):

RegT=O(ii1Δilog+(K1)T(ii1/Δi)2+C(ii1Δi)log+(K1)TCii1/Δi)\mathrm{Reg}_T = O\left(\sum_{i\ne i^*}\frac{1}{\Delta_i} \log_+\frac{(K-1)T}{(\sum_{i\ne i^*}1/\Delta_i)^2} + \sqrt{C\left(\sum_{i\ne i^*}\frac{1}{\Delta_i}\right)\log_+\frac{(K-1)T}{C\sum_{i\ne i^*}1/\Delta_i}} \right)

This guarantees smooth interpolation between logarithmic and sublinear regret rates depending on the corruption level (Masoudian et al., 2021).

The self-bounding constraint formalism subsumes all these regimes, enabling this unification. When C=0C=0, the bound specializes to the stochastic setting; when CTC\sim T, adversarial scaling is recovered.

3. Implementation and Theoretical Structure

Tsallis-INF is an FTRL algorithm with a non-Euclidean regularizer: wt=argmaxwΔK1{L^t1,wΨt(w)}w_t = \arg\max_{w\in\Delta^{K-1}}\left\{ -\langle \hat{L}_{t-1}, w \rangle - \Psi_t(w) \right\} This update is implemented by evaluating the convex conjugate of Ψt+IΔ\Psi_t + I_\Delta, though in recent analyses (Lee et al., 14 Nov 2025) the update rule and regret can be analyzed and implemented directly via spectral properties of the Tsallis entropy, avoiding explicit Fenchel conjugation.

Reduced-variance loss estimators, using bias correction, ensure variance remains controlled even for small probability arms:

  • When wt,iw_{t,i} is small, Bt(i)>0B_t(i)>0 increases the denominator in ^t,i\hat\ell_{t,i}, improving stability.

Choice of α=1/2\alpha=1/2 is essential: only this exponent achieves both adversarial and stochastic optimality in the bandit setting, due to the interplay between implicit exploration and loss minimization enforced by the square-root regularizer.

The learning rate ηt=4/t\eta_t = 4/\sqrt{t} is determined by calibration between the local-norm penalty and telescoping potential differences inherent in the FTRL local-norm regret lemma.

4. Analytical Techniques and Proof Innovations

Recent proofs (Lee et al., 14 Nov 2025, Masoudian et al., 2021) eschew Fenchel conjugate arguments in favor of local-norm analyses rooted in modern online convex optimization theory. The key lemma (FTRL local-norm regret bound) yields: t=1Tgt,wtuΨT+1(u)minwΨ1(w)+12t=1Tgt(2Ψt(zt))12+t=1T[Ψt(wt+1)Ψt+1(wt+1)]\sum_{t=1}^T \langle g_t, w_t - u\rangle \leq \Psi_{T+1}(u) - \min_w \Psi_1(w) + \tfrac{1}{2}\sum_{t=1}^T \|g_t\|_{(\nabla^2 \Psi_t(z_t))^{-1}}^2 + \sum_{t=1}^T[\Psi_t(w_{t+1}) - \Psi_{t+1}(w_{t+1})] For the Tsallis entropy in question, 2Ψt(x)=diag(λt4xi3/2)\nabla^2\Psi_t(x) = \mathrm{diag}(\frac{\lambda_t}{4} x_i^{-3/2}), and the spectral inverse is correspondingly simple. Analysis leverages the fact that the variance of reduced-variance losses can be upper bounded in terms of wt,i1/2w_{t,i}^{1/2}, allowing tight control over variance terms and summation by integrals bounded by O(T)O(\sqrt{T}) or O(logT)O(\log T) as appropriate.

In the self-bounding setting, the analysis further leverages constrained optimization over per-arm probabilities, producing refined logarithmic terms via tight single-step optimizations with quadratic-linear objectives (solved in closed form).

5. Comparative Landscape and Extensions

Tsallis-INF achieves a unique position among bandit algorithms:

  • EXP3 achieves minimax adversarial regret but not stochastic optimality.
  • UCB achieves stochastic optimality but is linear in adversarial settings.
  • Thompson Sampling exhibits near-linear regret in adversarial environments (Zimmert et al., 2018).

Tsallis-INF, by contrast, universally adapts to the underlying regime. Its self-bounding analysis generalizes to stochastic MDPs (episodic settings) and other online learning frameworks where regret bounds of the form Bt,iiE[wt,i]/t+DB\sum_{t,i\ne i^*}\sqrt{\mathbb{E}[w_{t,i}]/t} + D apply.

A summary comparison is given in the following table:

Algorithm Adversarial Regret Stochastic Regret
UCB1 Ω(T)\Omega(T) O(1/ΔilogT)O(\sum 1/\Delta_i \log T)
EXP3 O(KT)O(\sqrt{KT}) O(KT)O(\sqrt{KT})
Thompson Sampling Ω(T)\Omega(T) O(1/ΔilogT)O(\sum 1/\Delta_i \log T)
Tsallis-INF O(KT)O(\sqrt{KT}) O(1/ΔilogT)O(\sum 1/\Delta_i \log T)

6. Parameter Choices, Constants, and Practical Implications

The critical parameters are:

  • Exponent α=1/2\alpha=1/2: Unique for best-of-both-worlds optimality.
  • Learning rate ηt=4/t\eta_t = 4/\sqrt{t}: Required for balancing exploration-exploitation.
  • Constants: Non-optimized in several analyses (e.g., 32, 256 in bounds), with the focus on tightness of scaling laws rather than sharp constants.

No explicit prior knowledge of TT, regime, or loss distribution is needed. The analysis and implementation are robust to these unknowns.

Despite non-optimized constants, empirical evaluation shows Tsallis-INF significantly outperforms UCB1 and EXP3 in stochastic settings, and does not suffer catastrophic linear regret in adversarial settings, unlike Thompson Sampling (Zimmert et al., 2018).

7. Extensions and Generalizations

The Tsallis-INF approach extends through the self-bounding formalism to a range of bandit and online learning problems, including:

The optimization-based regret bounds allow direct application to any algorithm with matching potential-variance trade-offs, suggesting broader applicability in stochastic control and reinforcement learning settings.

A plausible implication is that further generalizations to structured action spaces or contextual bandits may be possible by modifying the regularizer and estimator structure accordingly, while retaining optimal interpolation between adversarial and stochastic regimes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Tsallis-INF Multi-Armed Bandit Algorithm.