Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Near-optimal Per-Action Regret Bounds for Sleeping Bandits (2403.01315v2)

Published 2 Mar 2024 in cs.LG and stat.ML

Abstract: We derive near-optimal per-action regret bounds for sleeping bandits, in which both the sets of available arms and their losses in every round are chosen by an adversary. In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(K\sqrt{TA\ln{K}})$, obtained indirectly via minimizing internal sleeping regrets. Compared to the minimax $\Omega(\sqrt{TA})$ lower bound, this upper bound contains an extra multiplicative factor of $K\ln{K}$. We address this gap by directly minimizing the per-action regret using generalized versions of EXP3, EXP3-IX and FTRL with Tsallis entropy, thereby obtaining near-optimal bounds of order $O(\sqrt{TA\ln{K}})$ and $O(\sqrt{T\sqrt{AK}})$. We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way. This leads to new proofs for a number of existing adaptive and tracking regret bounds for standard non-sleeping bandits. Extending our results to the bandit version of experts that report their confidences leads to new bounds for the confidence regret that depends primarily on the sum of experts' confidences. We prove a lower bound, showing that for any minimax optimal algorithms, there exists an action whose regret is sublinear in $T$ but linear in the number of its active rounds.

Near-Optimal Regret Bounds for Sleeping Bandits via Advanced Bandit Techniques

Introduction

Sleeping bandits, a variant of the multi-armed bandit (MAB) problem, accommodate dynamically changing action sets across rounds of interaction. This flexibility captures a wide range of practical scenarios from recommender systems to clinical trials, where the availability of options can vary over time. Despite the relevance of the sleeping bandits setting, achieving tight regret bounds—especially per-action regret bounds—has remained an elusive goal. This paper addresses this gap by offering direct minimizations of per-action regret using enhanced versions of existing bandit strategies, yielding near-optimal regret bounds. Additionally, we extend these findings to the setting of bandits with advice from sleeping experts, deriving new theoretical bounds for adaptive and tracking regrets that have implications for the broader field of sequential decision-making under uncertainty.

Near-optimal Regret Upper Bounds

Previously, the closest upper bound for per-action regret in sleeping bandits was substantially larger than the corresponding lower bound, suggesting room for improvement. By directly attacking per-action regret minimization employing modified versions of established algorithms like EXP3, EXP3-IX, and Follow-The-Regularized-Leader (FTRL) with Tsallis entropy, this work narrows the gap significantly. Specifically, we achieve bounds of order O(√TA ln K) and O(√T√AK), improving upon the best-known indirect bounds derived from minimizing internal sleeping regret.

Advanced Strategies for Sleeping Bandits

The core of our methodological contribution lies in the derivation of SB-EXP3 (Sleeping Bandits using EXP3) and FTARL (Follow-the-Active-and-Regularized-Leader), two algorithms that exhibit robust performance in the fully adversarial sleeping bandits scenario. SB-EXP3, in particular, leverages a novel decomposition technique for bounding potential function growth, catering to the intrinsic variability in active arms across rounds. Meanwhile, FTARL brings forward an iterative approach reminiscent of classic FTRL strategies, yet adapted with sleeping constraints and Tsallis entropy regularization. Together, these algorithms provide comprehensive tools capable of handling the complex dynamics inherent to sleeping bandits.

Generalizing to Bandits with Advice from Sleeping Experts

Extending the insights gained from sleeping bandits, we explore the domain of bandits receiving advice from intermittently available experts. By generalizing the EXP4 algorithm to account for sleeping experts, we derive parallel advancements in this space, showcasing its versatility and power. The resulting SE-EXP4 (Sleeping Experts version of EXP4) algorithm demonstrates that the methodologies developed for sleeping bandits can effectively be translated to tackle broader challenges within expert-advised bandit problems.

Implications for Adaptive and Tracking Regret

A crucial aspect of this work is the application of sleeping bandits-based approaches to obtain new proofs for adaptive and tracking regret bounds in standard (non-sleeping) bandit scenarios. By conceptualizing changes in the action set as changes in the availability of expert advice, we draw a direct parallel that enriches our understanding of adaptivity and tracking in sequential decision contexts. Through this lens, sleeping bandits are not merely a variant of the MAB problem but a framework through which the dynamics of learning and decision-making can be understood more holistically.

Directions for Future Research

While this paper makes significant strides toward optimizing regret bounds in sleeping bandits and related settings, intriguing questions remain open. An important direction for future work involves identifying whether the achieved O(√TA ln K) upper bounds are minimax optimal or if further refinement is possible. This inquiry might necessitate new methodological approaches or deeper theoretical insights, potentially expanding the frontier of what is achievable in sleeping bandits and beyond.

Conclusion

This paper represents a significant step forward in the quest to minimize regret in the dynamically evolving environments captured by sleeping bandits. By refining and extending established bandit algorithms, we offer near-optimal solutions to longstanding challenges and open new avenues for research in adaptive learning and stochastic optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, volume 28.
  2. A closer look at adaptive regret. Journal of Machine Learning Research, 17(23):1–21.
  3. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT).
  4. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
  5. Blum, A. (1997). Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23.
  6. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324.
  7. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8.
  8. Prediction with expert evaluators’ advice. In Algorithmic Learning Theory, pages 8–22, Berlin, Heidelberg.
  9. Prediction with advice of unknown number of experts. CoRR, abs/1006.0475.
  10. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1405–1411, Lille, France.
  11. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119–139.
  12. Using and combining predictors that specialize. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, STOC ’97, page 334–343, New York, NY, USA.
  13. One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206, pages 7755–7773.
  14. A second-order bound with excess losses. Journal of Machine Learning Research, 35.
  15. Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 393–400, New York, NY, USA.
  16. Tracking the best expert. Machine Learning, 32(2):151–178.
  17. Learning hurdles for sleeping experts. ACM Trans. Comput. Theory, 6(3).
  18. Regret bounds for sleeping experts and bandits. Machine Learning, 80(2–3):245–272.
  19. Bandit Algorithms. Cambridge University Press.
  20. Luo, H. (2017). Lecture 13, Introduction to Online Learning. https://haipeng-luo.net/courses/CSCI699/lecture13.pdf.
  21. Achieving all with no parameters: AdaNormalHedge. In Annual Conference Computational Learning Theory.
  22. Efficient contextual bandits in non-stationary worlds. In Bubeck, S., Perchet, V., and Rigollet, P., editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1739–1776. PMLR.
  23. Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, volume 28.
  24. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Advances in Neural Information Processing Systems, volume 27.
  25. Orabona, F. (2019). A modern introduction to online learning. CoRR, abs/1912.13213.
  26. Improved sleeping bandits with stochastic actions sets and adversarial rewards. In Proceedings of the 37th International Conference on Machine Learning, ICML’20.
  27. Slivkins, A. (2013). Dynamic ad allocation: Bandits with budgets. CoRR, abs/1306.0155.
  28. Slivkins, A. (2014). Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Quan Nguyen (85 papers)
  2. Nishant A. Mehta (22 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets