Finite-Time Logarithmic Bayes Regret Upper Bounds (2306.09136v3)
Abstract: We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In a multi-armed bandit, we obtain $O(c_\Delta \log n)$ and $O(c_h \log2 n)$ upper bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the logarithmic lower bound of Lai (1987).
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
- Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
- Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory, pages 39.1–39.26, 2012.
- Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, 2013.
- Mixed-effect Thompson sampling. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023.
- Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, 2009.
- Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages 322–331, 1995.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
- Meta dynamic pricing: Transfer learning across experiments. CoRR, abs/1902.10918, 2019. URL https://arxiv.org/abs/1902.10918.
- No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems 34, 2021.
- Christopher Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, 2006.
- An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2012.
- Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pages 355–366, 2008.
- The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pages 359–376, 2011.
- John Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41:148–177, 1979.
- Latent bandits revisited. In Advances in Neural Information Processing Systems 33, 2020.
- Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022.
- On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 592–600, 2012.
- Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pages 1297–1305, 2015.
- Meta-Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021.
- Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Bandit Algorithms. Cambridge University Press, 2019.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 2010.
- Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018.
- Collaborative filtering bandits. In Proceedings of the 39th Annual International ACM SIGIR Conference, 2016.
- Xiuyuan Lu and Benjamin Van Roy. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019.
- On the sub-Gaussianity of the beta and Dirichlet distributions. Electronic Communications in Probability, 22:1–14, 2017.
- Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68):1–30, 2016.
- A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
- Bayesian decision-making under misspecified priors with applications to meta-learning. In Advances in Neural Information Processing Systems 34, 2021.
- William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- John Tsitsiklis. A short proof of the gittins index theorem. Neural Computation, 4(1):194–199, 1994.
- Instance-optimality in interactive decision making: Toward a non-asymptotic theory. In Proceedings of the 36th Annual Conference on Learning Theory, 2023.
- Multitask bandit learning through heterogeneous feedback aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, 2021.
- Interactive collaborative filtering. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pages 1411–1420, 2013.
- Alexia Atsidakou (7 papers)
- Branislav Kveton (98 papers)
- Sumeet Katariya (20 papers)
- Constantine Caramanis (91 papers)
- Sujay Sanghavi (97 papers)