Near-optimal Per-Action Regret Bounds for Sleeping Bandits (2403.01315v2)

Published 2 Mar 2024 in cs.LG and stat.ML

Abstract: We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.

PDF HTML Abstract

Near-Optimal Regret Bounds for Sleeping Bandits via Advanced Bandit Techniques

Introduction

Sleeping bandits, a variant of the multi-armed bandit (MAB) problem, accommodate dynamically changing action sets across rounds of interaction. This flexibility captures a wide range of practical scenarios from recommender systems to clinical trials, where the availability of options can vary over time. Despite the relevance of the sleeping bandits setting, achieving tight regret bounds—especially per-action regret bounds—has remained an elusive goal. This paper addresses this gap by offering direct minimizations of per-action regret using enhanced versions of existing bandit strategies, yielding near-optimal regret bounds. Additionally, we extend these findings to the setting of bandits with advice from sleeping experts, deriving new theoretical bounds for adaptive and tracking regrets that have implications for the broader field of sequential decision-making under uncertainty.

Near-optimal Regret Upper Bounds

Previously, the closest upper bound for per-action regret in sleeping bandits was substantially larger than the corresponding lower bound, suggesting room for improvement. By directly attacking per-action regret minimization employing modified versions of established algorithms like EXP3, EXP3-IX, and Follow-The-Regularized-Leader (FTRL) with Tsallis entropy, this work narrows the gap significantly. Specifically, we achieve bounds of order O(√TA ln K) and O(√T√AK), improving upon the best-known indirect bounds derived from minimizing internal sleeping regret.

Advanced Strategies for Sleeping Bandits

The core of our methodological contribution lies in the derivation of SB-EXP3 (Sleeping Bandits using EXP3) and FTARL (Follow-the-Active-and-Regularized-Leader), two algorithms that exhibit robust performance in the fully adversarial sleeping bandits scenario. SB-EXP3, in particular, leverages a novel decomposition technique for bounding potential function growth, catering to the intrinsic variability in active arms across rounds. Meanwhile, FTARL brings forward an iterative approach reminiscent of classic FTRL strategies, yet adapted with sleeping constraints and Tsallis entropy regularization. Together, these algorithms provide comprehensive tools capable of handling the complex dynamics inherent to sleeping bandits.

Generalizing to Bandits with Advice from Sleeping Experts

Extending the insights gained from sleeping bandits, we explore the domain of bandits receiving advice from intermittently available experts. By generalizing the EXP4 algorithm to account for sleeping experts, we derive parallel advancements in this space, showcasing its versatility and power. The resulting SE-EXP4 (Sleeping Experts version of EXP4) algorithm demonstrates that the methodologies developed for sleeping bandits can effectively be translated to tackle broader challenges within expert-advised bandit problems.

Implications for Adaptive and Tracking Regret

A crucial aspect of this work is the application of sleeping bandits-based approaches to obtain new proofs for adaptive and tracking regret bounds in standard (non-sleeping) bandit scenarios. By conceptualizing changes in the action set as changes in the availability of expert advice, we draw a direct parallel that enriches our understanding of adaptivity and tracking in sequential decision contexts. Through this lens, sleeping bandits are not merely a variant of the MAB problem but a framework through which the dynamics of learning and decision-making can be understood more holistically.

Directions for Future Research

While this paper makes significant strides toward optimizing regret bounds in sleeping bandits and related settings, intriguing questions remain open. An important direction for future work involves identifying whether the achieved O(√TA ln K) upper bounds are minimax optimal or if further refinement is possible. This inquiry might necessitate new methodological approaches or deeper theoretical insights, potentially expanding the frontier of what is achievable in sleeping bandits and beyond.

Conclusion

This paper represents a significant step forward in the quest to minimize regret in the dynamically evolving environments captured by sleeping bandits. By refining and extending established bandit algorithms, we offer near-optimal solutions to longstanding challenges and open new avenues for research in adaptive learning and stochastic optimization.