Optimal Best-Arm Identification in Bandits with Access to Offline Data (2306.09048v1)
Abstract: Learning paradigms based purely on offline data as well as those based solely on sequential online learning have been well-studied in the literature. In this paper, we consider combining offline data with online learning, an area less studied but of obvious practical importance. We consider the stochastic $K$-armed bandit problem, where our goal is to identify the arm with the highest mean in the presence of relevant offline data, with confidence $1-\delta$. We conduct a lower bound analysis on policies that provide such $1-\delta$ probabilistic correctness guarantees. We develop algorithms that match the lower bound on sample complexity when $\delta$ is small. Our algorithms are computationally efficient with an average per-sample acquisition cost of $\tilde{O}(K)$, and rely on a careful characterization of the optimality conditions of the lower bound problem.
- Rajeev Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
- Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
- Shubhada Agrawal. Bandits with Heavy Tails: Algorithms, Analysis and Optimality. PhD thesis, Tata Institute of Fundamental Research, Mumbai, 2022.
- Optimal δ𝛿\deltaitalic_δ-correct best-arm selection for heavy-tailed distributions. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 61–110. PMLR, 08 Feb–11 Feb 2020.
- Regret minimization in heavy-tailed bandits. In Conference on Learning Theory, pages 26–62. PMLR, 2021.
- Best arm identification in multi-armed bandits. In COLT, pages 41–53, 2010.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- Artificial replay: a meta-algorithm for harnessing historical data in bandits. arXiv preprint arXiv:2210.00025, 2022.
- On best-arm identification with a fixed budget in non-parametric multi-armed bandits. arXiv preprint arXiv:2210.00895, 2022.
- Claude Berge. Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity. Courier Corporation, 1997.
- Optimal exploitation of clustering and history information in multi-armed bandit. arXiv preprint arXiv:1906.03979, 2019.
- Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning, pages 1202–1210. PMLR, 2020.
- Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings 20, pages 23–37. Springer, 2009.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.
- Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516–1541, 2013.
- Towards instance optimal bounds for best arm identification. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535–592. PMLR, 07–10 Jul 2017a. URL https://proceedings.mlr.press/v65/chen17b.html.
- Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110. PMLR, 2017b.
- Unimodal bandits without smoothness. arXiv preprint arXiv:1406.7447, 2014.
- Elements of information theory 2nd edition (wiley series in telecommunications and signal processing). Acessado em, 2006.
- Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019.
- Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
- Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012.
- The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359–376. JMLR Workshop and Conference Proceedings, 2011.
- Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027. PMLR, 2016.
- Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377–399, 2019.
- An asymptotically optimal policy for finite support models in the multiarmed bandit problem. arXiv preprint arXiv:0905.2776, 2009.
- An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67–79. Citeseer, 2010.
- lil’ ucb : An optimal exploration algorithm for multi-armed bandits. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 423–439, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL https://proceedings.mlr.press/v35/jamieson14.html.
- Top two algorithms revisited. arXiv preprint arXiv:2206.05979, 2022.
- Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pages 655–662, 2012.
- Almost optimal exploration in multi-armed bandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1238–1246, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/karnin13.html.
- Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228–251. PMLR, 2013.
- Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22(246):1–44, 2021. URL http://jmlr.org/papers/v22/18-798.html.
- Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 199–213. Springer, 2012.
- On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
- Thompson sampling for 1-dimensional exponential family bandits. Advances in neural information processing systems, 26, 2013.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Bandit algorithms. Cambridge University Press, 2020.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274, 2020.
- Offline neural contextual bandits: Pessimism, optimization and generalization. arXiv preprint arXiv:2111.13807, 2021.
- Cutting to the chase with warm-start contextual bandits. In 2021 IEEE International Conference on Data Mining (ICDM), pages 459–468. IEEE, 2021.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- Daniel Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418. PMLR, 2016.
- Multi-armed bandit problems with history. In Artificial Intelligence and Statistics, pages 1046–1054. PMLR, 2012.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- Rangarajan K Sundaram et al. A first course in optimization theory. Cambridge university press, 1996.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34:5810–5821, 2021.
- On the optimality of batch policy optimization algorithms. In International Conference on Machine Learning, pages 11362–11371. PMLR, 2021.
- Combining offline causal inference and online bandit learning for data driven decision. arXiv preprint arXiv:2001.05699, 2020.