HELLINGER-UCB: A novel algorithm for stochastic multi-armed bandit problem and cold start problem in recommender system (2404.10207v1)
Abstract: In this paper, we study the stochastic multi-armed bandit problem, where the reward is driven by an unknown random variable. We propose a new variant of the Upper Confidence Bound (UCB) algorithm called Hellinger-UCB, which leverages the squared Hellinger distance to build the upper confidence bound. We prove that the Hellinger-UCB reaches the theoretical lower bound. We also show that the Hellinger-UCB has a solid statistical interpretation. We show that Hellinger-UCB is effective in finite time horizons with numerical experiments between Hellinger-UCB and other variants of the UCB algorithm. As a real-world example, we apply the Hellinger-UCB algorithm to solve the cold-start problem for a content recommender system of a financial app. With reasonable assumption, the Hellinger-UCB algorithm has a convenient but important lower latency feature. The online experiment also illustrates that the Hellinger-UCB outperforms both KL-UCB and UCB1 in the sense of a higher click-through rate (CTR).
- “Regression-based latent factor models” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 19–28
- Rajeev Agrawal “Sample mean based index policies by o (log n) regret for the multi-armed bandit problem” In Advances in applied probability 27.4 Cambridge University Press, 1995, pp. 1054–1078
- “Analysis of thompson sampling for the multi-armed bandit problem” In Conference on learning theory, 2012, pp. 39–1 JMLR WorkshopConference Proceedings
- “Minimax Policies for Adversarial and Stochastic Bandits.” In COLT 7, 2009, pp. 1–122
- Peter Auer, Nicolo Cesa-Bianchi and Paul Fischer “Finite-time analysis of the multiarmed bandit problem” In Machine learning 47 Springer, 2002, pp. 235–256
- Omar Besbes, Yonatan Gur and Assaf Zeevi “Stochastic multi-armed-bandit problem with non-stationary rewards” In Advances in neural information processing systems 27, 2014
- Apostolos N Burnetas and Michael N Katehakis “Optimal adaptive policies for sequential allocation problems” In Advances in Applied Mathematics 17.2 Elsevier, 1996, pp. 122–142
- “Kullback-Leibler upper confidence bounds for optimal sequential allocation” In The Annals of Statistics JSTOR, 2013, pp. 1516–1541
- “A multi-armed bandit model selection for cold-start user recommendation” In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, 2017, pp. 32–40
- “The KL-UCB algorithm for bounded stochastic bandits and beyond” In Proceedings of the 24th annual conference on learning theory, 2011, pp. 359–376 JMLR WorkshopConference Proceedings
- “KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
- “Tied boltzmann machines for cold start recommendations” In Proceedings of the 2008 ACM conference on Recommender systems, 2008, pp. 19–26
- “Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards.” In J. Mach. Learn. Res. 16, 2015, pp. 3721–3756
- “PAC subset selection in stochastic multi-armed bandits.” In ICML 12, 2012, pp. 655–662
- Tze Leung Lai and Herbert Robbins “Asymptotically efficient adaptive allocation rules” In Advances in applied mathematics 6.1 Academic Press, 1985, pp. 4–22
- Tor Lattimore “Refining the confidence level for optimistic bandit strategies” In The Journal of Machine Learning Research 19.1 JMLR. org, 2018, pp. 765–796
- “Testing Statistical Hypotheses”, Springer Texts in Statistics Springer New York, 2006 URL: https://books.google.com/books?id=K6t5qn-SEp8C
- “A contextual-bandit approach to personalized news article recommendation” In Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670
- “Stochastic multi-armed bandits in constant space” In International Conference on Artificial Intelligence and Statistics, 2018, pp. 386–394 PMLR
- “A minimax and asymptotically optimal algorithm for stochastic bandits” In International Conference on Algorithmic Learning Theory, 2017, pp. 223–237 PMLR
- “Pairwise preference regression for cold-start recommendation” In Proceedings of the third ACM conference on Recommender systems, 2009, pp. 21–28
- Lijing Qin, Shouyuan Chen and Xiaoyan Zhu “Contextual combinatorial bandit and its application on diversified online recommendation” In Proceedings of the 2014 SIAM International Conference on Data Mining, 2014, pp. 461–469 SIAM
- Herbert E. Robbins “Some aspects of the sequential design of experiments” In Bulletin of the American Mathematical Society 58, 1952, pp. 527–535 URL: https://api.semanticscholar.org/CorpusID:15556973
- Daniel Russo and Benjamin Van Roy “Learning to optimize via posterior sampling” In Mathematics of Operations Research 39.4 INFORMS, 2014, pp. 1221–1243
- William R Thompson “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples” In Biometrika 25.3-4 Oxford University Press, 1933, pp. 285–294