Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nash Regret Guarantees for Linear Bandits (2310.02023v1)

Published 3 Oct 2023 in cs.LG and cs.GT

Abstract: We obtain essentially tight upper bounds for a strengthened notion of regret in the stochastic linear bandits framework. The strengthening -- referred to as Nash regret -- is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards accumulated by the linear bandit algorithm. Since the geometric mean corresponds to the well-studied Nash social welfare (NSW) function, this formulation quantifies the performance of a bandit algorithm as the collective welfare it generates across rounds. NSW is known to satisfy fairness axioms and, hence, an upper bound on Nash regret provides a principled fairness guarantee. We consider the stochastic linear bandits problem over a horizon of $T$ rounds and with set of arms ${X}$ in ambient dimension $d$. Furthermore, we focus on settings in which the stochastic reward -- associated with each arm in ${X}$ -- is a non-negative, $\nu$-sub-Poisson random variable. For this setting, we develop an algorithm that achieves a Nash regret of $O\left( \sqrt{\frac{d\nu}{T}} \log( T |X|)\right)$. In addition, addressing linear bandit instances in which the set of arms ${X}$ is not necessarily finite, we obtain a Nash regret upper bound of $O\left( \frac{d\frac{5}{4}\nu{\frac{1}{2}}}{\sqrt{T}} \log(T)\right)$. Since bounded random variables are sub-Poisson, these results hold for bounded, positive rewards. Our linear bandit algorithm is built upon the successive elimination method with novel technical insights, including tailored concentration bounds and the use of sampling via John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  2. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013.
  3. Regret minimization in heavy-tailed bandits. In Conference on Learning Theory, pages 26–62. PMLR, 2021.
  4. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  5. My fair bandit: Distributed learning of max-min fairness with multi-player bandits. In International Conference on Machine Learning, pages 930–940. PMLR, 2020.
  6. Fairness and welfare quantification for regret in multi-armed bandits. arXiv preprint arXiv:2205.13930, 2022.
  7. A near-optimal algorithm for approximating the john ellipsoid. In Conference on Learning Theory, pages 849–873. PMLR, 2019.
  8. The unreasonable fairness of maximum nash welfare. ACM Transactions on Economics and Computation (TEAC), 7(3):1–32, 2019.
  9. Stochastic linear optimization under bandit feedback. 2008.
  10. Jürgen Eckhoff. Helly, radon, and carathéodory type theorems. In Handbook of convex geometry, pages 389–448. Elsevier, 1993.
  11. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  12. Fair algorithms for multi-agent multi-armed bandits. Advances in Neural Information Processing Systems, 34:24005–24017, 2021.
  13. Ralph Howard. The john ellipsoid theorem. University of South Carolina, 1997.
  14. Fairness in learning: Classic and contextual bandits. Advances in neural information processing systems, 29, 2016.
  15. William Kuszmaul and Qi Qi. The multiplicative version of azuma’s inequality, with an application to contention analysis. arXiv preprint arXiv:2102.05077, 2021.
  16. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
  17. Bandit algorithms. Cambridge University Press, 2020.
  18. Hervé Moulin. Fair division and collective welfare. MIT press, 2004.
  19. No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642–1650. PMLR, 2016.
  20. Achieving fairness in the stochastic multi-armed bandit problem. The Journal of Machine Learning Research, 22(1):7885–7915, 2021.
  21. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  22. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
  23. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  24. From ads to interventions: Contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications, pages 495–517, 2017.
  25. Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.
  26. A contextual-bandit-based approach for informed decision-making in clinical trials. Life, 12(8):1277, 2022.
  27. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical Science, 30:199–215, 05 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ayush Sawarni (5 papers)
  2. Soumybrata Pal (1 paper)
  3. Siddharth Barman (65 papers)
Citations (3)