Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Armed Bandits with Abstention (2402.15127v1)

Published 23 Feb 2024 in cs.LG, cs.IT, math.IT, and stat.ML

Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. Given this added layer of complexity, we ask whether we can develop efficient algorithms that are both asymptotically and minimax optimal. We answer this question affirmatively by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Numerical results further corroborate our theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory (COLT), pp.  39–1. JMLR Workshop and Conference Proceedings, 2012.
  2. Near-optimal regret bounds for Thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017.
  3. Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory (COLT), volume 7, pp.  1–122, 2009.
  4. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science, pp.  322–331. IEEE, 1995.
  5. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
  6. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.
  7. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, pp.  1516–1541, 2013.
  8. Predicting partial orders: Ranking with abstention. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2010, pp.  215–230. Springer, 2010.
  9. Chow, C.-K. An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, pp.  247–254, 1957.
  10. Chow, C.-K. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41–46, 1970.
  11. Learning with rejection. In Algorithmic Learning Theory: 27th International Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings 27, pp.  67–82. Springer, 2016.
  12. Online learning with abstention. In International Conference on Machine Learning, pp.  1059–1067. PMLR, 2018.
  13. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377–399, 2019.
  14. Classification with reject option. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, pp.  709–721, 2006.
  15. An asymptotically optimal bandit algorithm for bounded support models. In Conference on Learning Theory, pp.  67–79. Citeseer, 2010.
  16. Mots: Minimax optimal thompson sampling. In International Conference on Machine Learning, pp.  5074–5083. PMLR, 2021.
  17. Thompson sampling with less exploration is fast and optimal. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp.  15239–15261. PMLR, 2023.
  18. Towards optimally abstaining from prediction with ood test examples. Advances in Neural Information Processing Systems, 34:12774–12785, 2021.
  19. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pp.  199–213. Springer, 2012.
  20. Thompson sampling for 1-dimensional exponential family bandits. Advances in Neural Information Processing Systems, 26, 2013.
  21. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  22. Lattimore, T. Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research, 19(1):765–796, 2018.
  23. The end of optimism? An asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp.  728–737. PMLR, 2017.
  24. Bandit Algorithms. Cambridge University Press, 2020.
  25. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
  26. Ranking with abstention. arXiv preprint arXiv:2307.02035, 2023.
  27. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pp.  223–237. PMLR, 2017.
  28. Fast rates for online prediction with abstention. In Conference on Learning Theory, pp.  3030–3048. PMLR, 2020.
  29. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  30. Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009.
  31. Pointwise tracking the optimal regression function. Advances in Neural Information Processing Systems, 25, 2012.
  32. Regression with reject option and application to knn. Advances in Neural Information Processing Systems, 33:20073–20082, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com