Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Survival Bandit Problem (2206.03019v4)

Published 7 Jun 2022 in cs.LG and stat.ML

Abstract: We introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people's health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted. We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, giving an answer to an open problem by Perotto et al. (COLT 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Improved algorithms for linear stochastic bandits. In NeurIPS, pages 2312–2320, 2011.
  2. S. Agrawal and N.R. Devanur. Linear contextual bandits with knapsacks. In NeurIPS, pages 3458–3467, 2016.
  3. S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pages 1–39, 2012.
  4. The nonstochastic multiarmed bandit problem. SIAM, 32(1):48–77, 1995.
  5. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002.
  6. On multi-armed bandit designs for dose-finding clinical trials. Journal of Machine Learning Research, 22:1–38, 2021.
  7. Bandits with knapsacks. In FOCS, pages 1–55, 2013.
  8. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5, 2012.
  9. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.
  10. Kullback-Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
  11. A general approach to multi-armed bandits under risk criteria. In COLT, pages 1295–1306, 2018.
  12. Budget-constrained bandits over general cost and reward distributions. In AISTATS, pages 4388–4398, 2020.
  13. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  14. N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and Science Systems, pages 1404–1422, 2012.
  15. N. Cesa-Bianchi and F. Orabona. Online Learning Algorithms. Annual Review of Statistics and Its Application, 2021.
  16. O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In NeurIPS, pages 2249–2257, 2011.
  17. Strategies for safe multi-armed bandits with logarithmic regret and risk. In ICML, pages 3123–3148, 2022.
  18. Bandits with budgets: Regret lower bounds and optimal algorithms. In ACM SIGMETRICS Performance Evaluation Review, pages 245–257, 2015.
  19. A. Dembo and O. Zeitouni. Large Deviation Techniques and Applications. Springer Science, 2009.
  20. Multi-armed bandit with budget constraint and variable costs. In AAAI, pages 232–238, 2013.
  21. US Food and Drugs Administration. Adaptive designs for clinical trials of drugs and biologics. Guidance for Industry, 2019.
  22. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In ACML, pages 245–260, 2013.
  23. On explore-then-commit strategies. In NeurIPS, pages 784–792, 2016.
  24. Optimal contextual bandits with knapsacks under realizability via regression oracles. In AISTATS, pages 5011–5035, 2023.
  25. Portfolio allocation for bayesian optimization. In UAI, pages 327–336, 2011.
  26. X. Huo and F. Fu. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society open science, 2017.
  27. Adversarial bandits with knapsacks. In FOCS, pages 202–219, 2019.
  28. Online learning with vector costs and bandits with knapsacks. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2286–2305. PMLR, 2020.
  29. Thompson sampling for 1-dimensional exponential family bandits. In NeurIPS, pages 1448–1456, 2013.
  30. R. Kumar and R. Kleinberg. Non-monotonic resource utilization in the bandits with knapsacks problem. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 19248–19259. Curran Associates, Inc., 2022.
  31. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  32. T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge University Press, 2020.
  33. Z. Li and G. Stoltz. Contextual bandits with knapsacks for a conversion model. In Advances in Neural Information Processing Systems, volume 35, pages 35590–35602, 2022.
  34. Combinatorial bandits with linear constraints: Beyond knapsacks and fairness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 2997–3010. Curran Associates, Inc., 2022a.
  35. Non-stationary bandits with knapsacks. In Advances in Neural Information Processing Systems, volume 35, pages 16522–16532, 2022b.
  36. O-A. Maillard. Robust risk-averse stochastic multi-armed bandits. In Algorithmic Learning Theory, pages 218–233, 2013.
  37. Simple modification of the upper confidence bound algorithm by generalized weighted averages. In arXiv, 2023.
  38. Open problem: Risk of ruin in multiarmed bandits. In COLT, pages 3194–3197, 2019.
  39. Time is budget: A heuristic for reducing the risk of ruin in multi-armed gambler bandits. In International Conference on Innovative Techniques and Applications of Artificial Intelligence, pages 346–352, 2022.
  40. Unifying the stochastic and the adversarial bandits with knapsack. In IJCAI, pages 3311–3317, 2019.
  41. C. Riou and J. Honda. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pages 777–826, 2020.
  42. Risk–aversion in multi–armed bandits. In NeurIPS, pages 3284–3292, 2012.
  43. K. A. Sankararaman and A. Slivkins. Combinatorial semi-bandits with knapsacks. In AISTATS, pages 1760–1770, 2018.
  44. K. A. Sankararaman and A. Slivkins. Bandits with knapsacks beyond the worst case. In NeurIPS, pages 23191–23204, 2021.
  45. W. Shen and J. Wang. Portfolio blending via thompson sampling. In IJCAI, pages 1983–1989, 2016.
  46. Portfolio choices with orthogonal bandit learning. In IJCAI, pages 974–980, 2015.
  47. Robert H. Shmerling. Are early detection and treatment always best? In Harvard Health Publishing, 2021.
  48. Multi-armed bandits with cost subsidy. In AISTATS, pages 3016–3024, 2021.
  49. Smoothed adversarial linear contextual bandits with knapsacks. In ICML, pages 20253–20277, 2022.
  50. ϵitalic-ϵ\epsilonitalic_ϵ–first policies for budget–limited multi-armed bandits. In AAAI, pages 1211–1216, 2010.
  51. Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI, pages 1134–1140, 2012.
  52. S. Vakili and Q. Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. Signal Processing, 10:1093–1111, 2016.
  53. Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statistical science, 30:199–215, 2015.
  54. Bandit problems with side observations. IEEE, pages 338–355, 2005.
  55. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 433––441. Curran Associates, Inc., 2015.
  56. Conservative bandits. In ICML, pages 1254–1262, 2016.
  57. Budgeted bandit problems with continuous random costs. In ACML, pages 317–332, 2015a.
  58. Thompson sampling for budgeted multi-armed bandits. In IJCAI, pages 3960–3966, 2015b.
  59. Budgeted multi-armed bandits with multiple plays. In IJCAI, pages 773–818, 2016.
  60. Finite budget analysis of multi-armed bandit problems. Neurocomputing, 258:13–29, 2017.
  61. D. Zhou and C. Tomlin. Budget-constrained multi-armed bandits with multiple plays. In AAAI, pages 4572–4579, 2018.
  62. Q. Zhu and V. Tan. Thompson sampling algorithms for mean-variance bandits. In ICML, pages 11599–11608, 2020.
  63. Generalized risk-aversion in stochastic multi-armed bandits. In arXiv, 2014.
Citations (3)

Summary

We haven't generated a summary for this paper yet.