Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling (2303.09033v2)
Abstract: Most bandit algorithms assume that the reward variances or their upper bounds are known, and that they are the same for all arms. This naturally leads to suboptimal performance and higher regret due to variance overestimation. On the other hand, underestimated reward variances may lead to linear regret due to committing early to a suboptimal arm. This motivated prior works on variance-adaptive frequentist algorithms, which have strong instance-dependent regret bounds but cannot incorporate prior knowledge on reward variances. We lay foundations for the Bayesian setting, which incorporates prior knowledge. This results in lower regret in practice, due to using the prior in the algorithm design, and also improved regret guarantees. Specifically, we study Gaussian bandits with {unknown heterogeneous reward variances}, and develop a Thompson sampling algorithm with prior-dependent Bayes regret bounds. We achieve lower regret with lower reward variances and more informative priors on them, which is precisely why we pay only for what is uncertain. This is the first result of its kind. Finally, we corroborate our theory with extensive experiments, which show the superiority of our variance-adaptive Bayesian algorithm over prior frequentist approaches. We also show that our approach is robust to model misspecification and can be applied with estimated priors.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
- Analysis of Thompson sampling for the multi-armed bandit problem. In Proceeding of the 25th Annual Conference on Learning Theory, pages 39.1–39.26, 2012.
- Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, 2013.
- Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009a.
- Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009b.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
- Pure exploration in multi-armed bandits problems. In Proceedings of the 20th International Conference on Algorithmic Learning Theory, pages 23–37, 2009.
- An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2012.
- John D Cook. Inverse gamma distribution. online: http://www. johndcook. com/inverse gamma. pdf, Tech. Rep, 2008.
- Rapidly finding the best arm using variance. In Proceedings of the 24th European Conference on Artificial Intelligence, 2020.
- Multi-bandit best arm identification. In Advances in Neural Information Processing Systems 24, 2011.
- The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceeding of the 24th Annual Conference on Learning Theory, pages 359–376, 2011.
- Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York, NY, 2007.
- Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning, pages 100–108, 2014.
- Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, 2014.
- Deep hierarchy in bandits. In Proceedings of the 39th International Conference on Machine Learning, 2022a.
- Hierarchical Bayesian bandits. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022b.
- Dealing with unknown variances in best-arm identification. CoRR, abs/2210.00974, 2022. URL https://arxiv.org/abs/2210.00974.
- Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. Advances in Neural Information Processing Systems, 35:1060–1072, 2022.
- Meta-Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Tze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
- Fixed-budget best-arm identification with heterogeneous reward variances. In Uncertainty in Artificial Intelligence, pages 1164–1173. PMLR, 2023.
- Bandit Algorithms. Cambridge University Press, 2019.
- Bandit algorithms. Cambridge University Press, 2020.
- Variance-dependent best arm identification. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence, 2021.
- Xiuyuan Lu and Benjamin Van Roy. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems 32, 2019.
- Efficient-ucbv: An almost optimal algorithm using variance estimates. AAAI Conference on Artificial Intelligence, 32, 2018.
- Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. def, 1(2σ𝜎\sigmaitalic_σ2):16, 2007.
- Karl Pearson. Method of moments and method of maximum likelihood. Biometrika, 28(1/2):34–59, 1936.
- Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68):1–30, 2016.
- A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
- Improved sleeping bandits with stochastic action sets and adversarial rewards. In International Conference on Machine Learning, pages 8357–8366. PMLR, 2020.
- William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Metadata-based multi-task bandits with Bayesian hierarchical models. In Advances in Neural Information Processing Systems 34, 2021.
- Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34:4342–4355, 2021.
- Bandit learning with general function classes: Heteroscedastic noise and variance-dependent regret bounds. 2022.
- Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. arXiv preprint arXiv:2302.10371, 2023.
- Approximate top-m𝑚mitalic_m arm identification with heterogeneous reward variances. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 2022.
- Thompson sampling algorithms for mean-variance bandits. In International Conference on Machine Learning, pages 11599–11608. PMLR, 2020.
- Aadirupa Saha (39 papers)
- Branislav Kveton (98 papers)