Optimal $δ$-Correct Best-Arm Selection for Heavy-Tailed Distributions (1908.09094v3)
Abstract: Given a finite set of unknown distributions or arms that can be sampled, we consider the problem of identifying the one with the maximum mean using a $\delta$-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified $\delta$) that has minimum sample complexity. Lower bounds for $\delta$-correct algorithms are well known. $\delta$-correct algorithms that match the lower bound asymptotically as $\delta$ reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise, under a $\delta$-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a $\delta$-correct algorithm that matches the lower bound as $\delta$ reduces to zero under the mild restriction that a known bound on the expectation of $(1+\epsilon){th}$ moment of the underlying random variables exists, for $\epsilon > 0$. We also propose batch processing and identify near-optimal batch sizes to speed up the proposed algorithm substantially. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well-studied classic problem in the simulation community.
- Optimal δ𝛿\deltaitalic_δ-correct best-arm selection for heavy-tailed distributions. In Algorithmic Learning Theory, pages 61–110. PMLR, 2020a.
- Erratum: Optimal δ𝛿\deltaitalic_δ-correct best-arm selection for heavy-tailed distributions. 2020b. URL https://drive.google.com/file/d/1MkFNpERKDdC2eV5QHj4rkmya8WJZ5VPN/view?usp=sharing.
- Regret minimization in heavy-tailed bandits. In Conference on Learning Theory, pages 26–62. PMLR, 2021a.
- Optimal best-arm identification methods for tail-risk measures. Advances in Neural Information Processing Systems, 34:25578–25590, 2021b.
- Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
- The nonexistence of certain statistical procedures in nonparametric problems. The Annals of Mathematical Statistics, 27(4):1115–1122, 1956.
- Sequential identification and ranking procedures: with special reference to Koopman-Darmois populations, volume 3. University of Chicago Press, 1968.
- C. Berge. Topological Spaces: Including a Treatment of Multi-valued Functions, Vector Spaces, and Convexity. Dover books on mathematics. Dover Publications, 1997. ISBN 9780486696539.
- P. Billingsley. Convergence of Probability Measures. Wiley Series in Probability and Statistics. Wiley, 2013.
- Convex optimization. Cambridge university press, 2004.
- Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852, 2011.
- Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 – 142, 1996.
- Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
- Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, 10(3):251–270, 2000.
- Towards instance optimal bounds for best arm identification. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535–592. PMLR, 07–10 Jul 2017a. URL https://proceedings.mlr.press/v65/chen17b.html.
- Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110. PMLR, 2017b.
- Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
- L Dai. Convergence properties of ordinal comparison in the simulation of discrete event dynamic systems. Journal of Optimization Theory and Applications, 91(2):363–388, 1996.
- Pure exploration with multiple correct answers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 14591–14600. Curran Associates, Inc., 2019.
- Non-asymptotic pure exploration by solving games. In Advances in Neural Information Processing Systems 32, pages 14465–14474, 2019.
- Large deviations techniques and applications. corrected reprint of the second (1998) edition. stochastic modelling and applied probability, 38, 2010.
- Thompson sampling on symmetric α𝛼\alphaitalic_α-stable bandits. arXiv preprint arXiv:1907.03821, 2019.
- Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105, 2006.
- Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pages 3212–3220, 2012.
- Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027, 2016.
- Nonasymptotic sequential tests for overlapping hypotheses applied to near-optimal arm identification in bandit models. Sequential Analysis, 40(1):61–96, 2021.
- Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377–399, 2019.
- On choosing and bounding probability metrics. International statistical review, 70(3):419–435, 2002.
- A large deviations perspective on ordinal optimization. In Proceedings of the 36th conference on Winter simulation, pages 577–585. Winter Simulation Conference, 2004.
- Bounding stationary expectations of markov processes. In Markov processes and related topics: a Festschrift for Thomas G. Kurtz, pages 195–214. Institute of Mathematical Statistics, 2008.
- Ordinal optimization of deds. Discrete event dynamic systems, 2(1):61–88, 1992.
- An asymptotically optimal bandit algorithm for bounded support models. In In Proceedings of the Twenty-third Conference on Learning Theory (COLT 2010, pages 67–79. Omnipress, 2010.
- Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. The Journal of Machine Learning Research, 16(1):3721–3756, 2015.
- lil’UCB: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.
- Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. In Statistical decision theory and related topics III, pages 55–86. Elsevier, 1982.
- Choosing answers in ε𝜀\varepsilonitalic_ε-best-answer identification for linear bandits. arXiv preprint arXiv:2206.04456, 2022.
- Sample complexity of partition identification using multi-armed bandits. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 1824–1852. PMLR, 25–28 Jun 2019.
- PAC subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655–662, 2012.
- Almost optimal exploration in multi-armed bandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1238–1246, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/karnin13.html.
- Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228–251, 2013.
- On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
- A fully sequential procedure for indifference-zone selection in simulation. ACM Transactions on Modeling and Computer Simulation (TOMACS), 11(3):251–273, 2001.
- A characterization of online browsing behavior. In Proceedings of the 19th international conference on World wide web, pages 561–570, 2010.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Testing statistical hypotheses. Springer Science & Business Media, 2006.
- Delay bounds in communication networks with heavy-tailed and self-similar traffic. IEEE Transactions on Information Theory, 58(2):1010–1024, 2012.
- D.G. Luenberger. Optimization by Vector Space Methods. Series in Decision and Control. Wiley, 1969.
- The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
- Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pages 1269–1283, 1990.
- Art B. Owen. Empirical likelihood. CRC press, 2001.
- Edward Paulson et al. A sequential procedure for selecting the population with the largest mean from k𝑘kitalic_k normal populations. The Annals of Mathematical Statistics, 35(1):174–180, 1964.
- E. Posner. Random coding strategies for minimum entropy. IEEE Transactions on Information Theory, 21(4):388–391, 1975.
- Daniel Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418, 2016.
- Rangarajan K. Sundaram. A First Course in Optimization Theory. Cambridge University Press, 1996.
- Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.
- Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34:5810–5821, 2021.
- David Williams. Probability with martingales. Cambridge university press, 1991.
- Pure exploration of multi-armed bandits with heavy-tailed payoffs. In UAI, 2018.