Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meta-Learning for Simple Regret Minimization (2202.12888v2)

Published 25 Feb 2022 in cs.LG, cs.AI, and stat.ML

Abstract: We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere $\tilde{O}(m / \sqrt{n})$. On the other hand, the meta simple regret of the frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$. While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
  2. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
  3. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99–107. PMLR, 2013.
  4. Robust pure exploration in linear bandits with limited budget. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 187–195. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/alieva21a.html.
  5. Best Arm Identification in Multi-Armed Bandits. In COLT - 23th Conference on Learning Theory - 2010, page 13 p., Haifa, Israel, June 2010. URL https://hal-enpc.archives-ouvertes.fr/hal-00654404.
  6. Sequential transfer in multi-armed bandit with finite set of models. In Advances in Neural Information Processing Systems 26, pages 2220–2228, 2013.
  7. Fixed-budget best-arm identification in structured bandits. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2798–2804, 2022.
  8. Meta dynamic pricing: Transfer learning across experiments. Management Science, 2019. doi: 10.1287/mnsc.2021.4071. URL https://doi.org/10.1287/mnsc.2021.4071.
  9. No regrets for learning the prior in bandits. Advances in Neural Information Processing Systems, 34, 2021.
  10. Jonathan Baxter. Theoretical models of learning to learn. In Learning to Learn, pages 71–94. Springer, 1998.
  11. Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.
  12. Differentiable meta-learning of bandit policies. In Advances in Neural Information Processing Systems 33, 2020.
  13. Meta-learning with stochastic linear bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  14. Multi-task learning for contextual bandits. In Advances in Neural Information Processing Systems 30, pages 4848–4856, 2017.
  15. RL22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  16. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems 31, pages 9537–9548, 2018.
  17. Best arm identification: A unified approach to fixed budget and fixed confidence. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25, pages 3212–3220. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/8b0d268963dd0cfb808aac48a549829f-Paper.pdf.
  18. Online clustering of bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 757–765, 2014.
  19. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Artificial Intelligence and Statistics, pages 365–374. PMLR, 2014.
  20. Hierarchical Bayesian bandits. In International Conference on Artificial Intelligence and Statistics, pages 7724–7741. PMLR, 2022.
  21. Empirical bayes regret minimization. arXiv preprint arXiv:1904.02664, 2019.
  22. On the complexity of best-arm identification in multi-armed bandit models. JMLR, 17:1:1–1:42, 2016.
  23. Optimal simple regret in bayesian best arm identification. arXiv preprint arXiv:2111.09885, 2021.
  24. Algorithms for multi-armed bandit problems. CoRR, abs/1402.6028, 2014. URL http://arxiv.org/abs/1402.6028.
  25. Differentiable meta-learning in contextual bandits. arXiv e-prints, pages arXiv–2006, 2020.
  26. Meta-thompson sampling. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pages 5884–5893, 2021.
  27. Bandit algorithms. Cambridge University Press, 2020.
  28. Gaussian imagination in bandit learning. arXiv preprint arXiv:2201.01902, 2022.
  29. Information-theoretic confidence bounds for reinforcement learning. In Advances in Neural Information Processing Systems, volume 32, 2019.
  30. Meta-learning of exploration/exploitation strategies: The multi-armed bandit case. In Proceedings of the 4th International Conference on Agents and Artificial Intelligence, pages 100–115, 2012.
  31. Finding all \epsilon-good arms in stochastic bandits. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20707–20718. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/edf0320adc8658b25ca26be5351b6c4a-Paper.pdf.
  32. Policy gradient optimization of thompson sampling policies. arXiv preprint arXiv:2006.16507, 2020.
  33. Top-m identification for linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 1108–1116. PMLR, 2021.
  34. Daniel Russo. Simple bayesian algorithms for best-arm identification. Operations Research, 68(6):1625–1647, 2020.
  35. Bayesian decision-making under misspecified priors with applications to meta-learning. Advances in Neural Information Processing Systems, 34, 2021.
  36. Sebastian Thrun. Explanation-Based Neural Network Learning - A Lifelong Learning Approach. PhD thesis, University of Bonn, 1996.
  37. Sebastian Thrun. Lifelong learning algorithms. In Learning to Learn, pages 181–209. Springer, 1998.
  38. Estimation of parameters in the beta binomial model. Annals of the Institute of Statistical Mathematics, 46(2):317–331, 1994.
  39. Multi-armed bandit algorithms and empirical evaluation. In Proceedings of the 16th European Conference on Machine Learning, pages 437–448, 2005.
  40. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  41. Metadata-based multi-task bandits with Bayesian hierarchical models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=nW4xl2CjcVg.
  42. Fully adaptive algorithm for pure exploration in linear bandits, 2018.
  43. Differentiable linear bandit algorithm. arXiv preprint arXiv:2006.03000, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Branislav Kveton (98 papers)
  2. Mohammad Ghavamzadeh (97 papers)
  3. Sumeet Katariya (20 papers)
  4. MohammadJavad Azizi (3 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.