Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits (1807.07623v6)

Published 19 Jul 2018 in cs.LG and stat.ML

Abstract: We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.

Authors (2)

Julian Zimmert (30 papers)
Yevgeny Seldin (26 papers)

Citations (168)

View on Semantic Scholar

Summary

Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

This paper, written by Julian Zimmert and Yevgeny Seldin, introduces an enhanced version of the Tsallis-INF algorithm, addressing its performance in stochastic and adversarial bandit settings. The authors discuss multiple aspects, including the algorithm's theoretical underpinning, empirical performance, and the revision processes reflected through their interaction with peer reviewers.

Improved Algorithmic Performance

The primary contribution of this paper is the revision and enhancement of the Tsallis-INF algorithm. The authors demonstrate that by utilizing reduced-variance importance-weighted loss estimators, the algorithm achieves improved leading constants in both the stochastic and adversarial settings. Specifically, in the stochastic regime, the improvement results in matching the lower bound within a multiplicative factor of 2, while in the adversarial regime, it aligns with the best-known leading constant.

Empirical and Theoretical Insights

Empirical results highlight substantial progress due to these enhancements. The revised Tsallis-INF algorithm competes closely with Thompson Sampling in stochastic situations and significantly surpasses alternatives in stochastically constrained adversarial settings. A focal point is the case when the regularization power is set to $\alpha=1/2$ , which appears to be optimal in the revised contexts.

Furthermore, the paper expands the analytical treatment of several concepts. For instance, the necessity of the assumption on the uniqueness of the best arm is detailed, and a self-bounding analysis framework is introduced, encapsulating both stochastic and adversarial settings. The new self-bounding adversarial regime is a prominent theoretical advancement, including stochastic regimes, adversarially constrained scenarios, and stochastic bandits with adversarial corruptions.

Methodological Advancements

The authors provide a tighter analysis of stochastic bandits subject to adversarial corruption, especially when the corruption level, $C$ , dominates the regret bound. This improved analysis further consolidates the robustness and applicability of the algorithm across varied adversarial scenarios.

Additionally, the respondents to peer reviews illustrate rigorous adjustments made to enhance the manuscript's clarity and depth. They address terminologies such as "logarithmic" regret, refine explanations on algorithmic implementations, and expand theoretical discussions to accommodate critiques from academic peers, thereby ensuring the comprehensiveness and reliability of the claims presented.

Implications and Future Directions

The implications of this work are substantial for multi-armed bandits' research and applications in reinforcement learning and sequential decision-making. By refining the Tsallis-INF algorithm to better balance performance across stochastic and adversarial settings, the authors open avenues for broader applicability in environments with uncertain or variable adversarial influences.

The enhanced algorithm spurred by these adjustments proposes new potential for future explorations. Further theoretical inquiries could focus on reducing the existing multiplicative factor gap in relation to known lower bounds. Practical deployments could explore the scalability and flexibility of this approach in real-world applications where data heterogeneity mirrors the theoretical settings discussed.

In conclusion, the paper presents significant methodological advancements and empirical enhancements of the Tsallis-INF algorithm. It provides a foundation for both theoretical insights and practical implementations while addressing the perennial challenge of balancing performance across stochastic and adversarial bandit environments.

PDF Markdown