Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Best-of-Both-Worlds Linear Contextual Bandits (2312.16489v1)

Published 27 Dec 2023 in cs.LG, cs.AI, econ.EM, stat.ME, and stat.ML

Abstract: This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. At each round, a decision-maker observes an independent and identically distributed context and then selects an arm based on the context and past observations. After selecting an arm, the decision-maker incurs a loss corresponding to the selected arm. The decision-maker aims to minimize the cumulative loss over the trial. The goal of this study is to develop a strategy that is effective in both stochastic and adversarial environments, with theoretical guarantees. We first formulate the problem by introducing a novel setting of bandits with adversarial corruption, referred to as the contextual adversarial regime with a self-bounding constraint. We assume linear models for the relationship between the loss and the context. Then, we propose a strategy that extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is shown to be upper-bounded by $O\left(\min\left{\frac{(\log(T))3}{\Delta_{*}} + \sqrt{\frac{C(\log(T))3}{\Delta_{*}}},\ \ \sqrt{T}(\log(T))2\right}\right)$, where $T \in\mathbb{N}$ is the number of rounds, $\Delta_{} > 0$ is the constant minimum gap between the best and suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption parameter. This regret upper bound implies $O\left(\frac{(\log(T))3}{\Delta_{}}\right)$ in a stochastic environment and by $O\left( \sqrt{T}(\log(T))2\right)$ in an adversarial environment. We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both stochastic and adversarial regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2011.
  2. Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning (ICML), 1999.
  3. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory (COLT), 2016.
  4. Online decision making with high-dimensional covariates. Operations Research, 68(1), 2020.
  5. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  6. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory (COLT), 2012.
  7. Efficient and robust high-dimensional linear contextual bandits. In International Joint Conference on Artificial Intelligence (IJCAI), 2020.
  8. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  9. Stochastic linear optimization under bandit feedback. In Annual Conference Computational Learning Theory (COLT), 2008.
  10. Best of both worlds policy optimization. In International Conference on Machine Learning (ICML), 2023.
  11. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  12. A linear response bandit problem. Stochastic Systems, 2013.
  13. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory (COLT), 2019.
  14. Online learning with low rank experts. In Conference on Learning Theory (COLT), 2016.
  15. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  16. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  17. Improved best-of-both-worlds guarantees for multi-armed bandits: Ftrl with general regularizers and multiple optimal arms. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  18. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory, 6(3), 2014.
  19. Best-of-three-worlds analysis for linear bandits with follow-the-regularized-leader algorithm. In Conference on Learning Theory (COLT), 2023.
  20. The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  21. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning (ICML), 2021.
  22. Regret lower bound and optimal algorithm for high-dimensional contextual linear bandit. Electronic Journal of Statistics, 15(2):5652 – 5695, 2021.
  23. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web (WWW), 2010.
  24. Bypassing the simulator: Near-optimal adversarial linear contextual bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  25. Competitive caching with machine learned advice. In International Conference on Machine Learning (ICML), 2018.
  26. Efficient and robust algorithms for adversarial linear contextual bandits. In Conference on Learning Theory (COLT), 2020.
  27. Bistro: An efficient relaxation-based method for contextual bandits. In International Conference on Machine Learning (ICML), 2016.
  28. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  29. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory (COLT), 2017.
  30. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning (ICML), 2014.
  31. Improved regret bounds for oracle-based adversarial contextual bandits. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  32. From ads to interventions: Contextual bandits in mobile health. In Mobile Health: Sensors, Analytic Methods, and Applications, pp.  495–517, 2017.
  33. Best-of-both-worlds algorithms for partial monitoring. In International Conference on Algorithmic Learning Theory (ALT), 2023a.
  34. Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  35. Minimax concave penalized multi-armed bandit model with high-dimensional covariates. In International Conference on Machine Learning (ICML), 2018.
  36. More adaptive algorithms for adversarial bandits. In Conference on Learning Theory (COLT), 2018.
  37. Linear contextual bandits with adversarial corruptions, 2021. URL https://openreview.net/forum?id=Wz-t1oOTWa.
  38. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(1), 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Masahiro Kato (49 papers)
  2. Shinji Ito (31 papers)