Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transfer in Sequential Multi-armed Bandits via Reward Samples (2403.12428v1)

Published 19 Mar 2024 in cs.LG and stat.ML

Abstract: We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. S. Bubeck, N. Cesa-Bianchi, et al., “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
  2. Cambridge University Press, 2020.
  3. H. Robbins, “Some aspects of the sequential design of experiments,” 1952.
  4. D. Bouneffouf, I. Rish, and C. Aggarwal, “Survey on applications of multi-armed and contextual bandits,” in 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8, 2020.
  5. N. Silva, H. Werneck, T. Silva, A. C. Pereira, and L. Rocha, “Multi-armed bandits in recommendation systems: A survey of the state-of-the-art and future directions,” Expert Systems with Applications, vol. 197, p. 116669, 2022.
  6. A. Lazaric, E. Brunskill, et al., “Sequential transfer in multi-armed bandit with finite set of models,” Advances in Neural Information Processing Systems, vol. 26, 2013.
  7. A. Shilton, S. Gupta, S. Rana, and S. Venkatesh, “Regret bounds for transfer learning in bayesian optimisation,” in Artificial Intelligence and Statistics, pp. 307–315, PMLR, 2017.
  8. P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, pp. 235–256, 2002.
  9. J. Zhang and E. Bareinboim, “Transfer learning in multi-armed bandit: a causal approach,” in Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1778–1780, 2017.
  10. A. A. Deshmukh, U. Dogan, and C. Scott, “Multi-task learning for contextual bandits,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  11. B. Liu, Y. Wei, Y. Zhang, Z. Yan, and Q. Yang, “Transferable contextual bandit for cross-domain recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  12. L. Cella, A. Lazaric, and M. Pontil, “Meta-learning with stochastic linear bandits,” in International Conference on Machine Learning, pp. 1360–1370, PMLR, 2020.
  13. L. Cella and M. Pontil, “Multi-task and meta-learning with sparse linear bandits,” in Uncertainty in Artificial Intelligence, pp. 1692–1702, PMLR, 2021.
  14. J. Azizi, B. Kveton, M. Ghavamzadeh, and S. Katariya, “Meta-learning for simple regret minimization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 6709–6717, 2023.
  15. A. Garivier and E. Moulines, “On upper-confidence bound policies for switching bandit problems,” in International Conference on Algorithmic Learning Theory, pp. 174–188, Springer, 2011.
  16. W. Hoeffding, “Probability inequalities for sums of bounded random variables,” The collected works of Wassily Hoeffding, pp. 409–426, 1994.
  17. C. McDiarmid et al., “On the method of bounded differences,” Surveys in combinatorics, vol. 141, no. 1, pp. 148–188, 1989.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Rahul N R (1 paper)
  2. Vaibhav Katewa (17 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com