Biased Dueling Bandits with Stochastic Delayed Feedback (2408.14603v1)
Abstract: The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information retrieval, and more. However, in many real-world applications, the feedback for actions is often subject to unavoidable delays and is not immediately available to the agent. This partially observable issue poses a significant challenge to existing dueling bandit literature, as it significantly affects how quickly and accurately the agent can update their policy on the fly. In this paper, we introduce and examine the biased dueling bandit problem with stochastic delayed feedback, revealing that this new practical problem will delve into a more realistic and intriguing scenario involving a preference bias between the selections. We present two algorithms designed to handle situations involving delay. Our first algorithm, requiring complete delay distribution information, achieves the optimal regret bound for the dueling bandit problem when there is no delay. The second algorithm is tailored for situations where the distribution is unknown, but only the expected value of delay is available. We provide a comprehensive regret analysis for the two proposed algorithms and then evaluate their empirical performance on both synthetic and real datasets.
- Learning community-based preferences via dirichlet process mixtures of gaussian processes. In Twenty-third international joint conference on artificial intelligence, 2013.
- Stochastic dueling bandits with adversarial corruption. In Algorithmic Learning Theory, pp. 217–248. PMLR, 2021.
- Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pp. 856–864. PMLR, 2014.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- Instance-dependent regret bounds for dueling bandits. In Conference on Learning Theory, pp. 336–360. PMLR, 2016.
- Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278–385, 2021.
- Online exp3 learning in adversarial bandits with delayed feedback. Advances in neural information processing systems, 32, 2019.
- Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pp. 605–622. PMLR, 2016.
- Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1097–1105, 2014.
- Dueling bandits with weak regret. In International Conference on Machine Learning, pp. 731–739. PMLR, 2017.
- Adaptive design methods in clinical trials. Chapman and Hall/CRC, 2006.
- Contextual dueling bandits. In Conference on Learning Theory, pp. 563–587. PMLR, 2015.
- David A Freedman. On tail probabilities for martingales. The Annals of Probability, pp. 100–118, 1975.
- Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pp. 3348–3356. PMLR, 2020.
- A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning, pp. 218–227. PMLR, 2015.
- Best arm identification in multi-armed bandits with delayed feedback. In International Conference on Artificial Intelligence and Statistics, pp. 833–842. PMLR, 2018.
- Testification of condorcet winners in dueling bandits. In Uncertainty in Artificial Intelligence, pp. 1195–1205. PMLR, 2021.
- Delayed feedback in generalised linear bandits revisited. In International Conference on Artificial Intelligence and Statistics, pp. 6095–6119. PMLR, 2023.
- Sparse Dueling Bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 416–424. PMLR, 2015.
- Near-optimal regret for adversarial mdp with delayed bandit feedback. Advances in Neural Information Processing Systems, 35:33469–33481, 2022.
- Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453–1461. PMLR, 2013.
- Toshihiro Kamishima. Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 583–588, 2003.
- Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory, pp. 1141–1154. PMLR, 2015.
- Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm. In International Conference on Machine Learning, pp. 1235–1244. PMLR, 2016.
- Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pp. 5969–5978. PMLR, 2021.
- Learning adversarial markov decision processes with delayed feedback. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7):7281–7289, 2022.
- A best-of-both-worlds algorithm for bandits with delayed feedback. Advances in Neural Information Processing Systems, 35:11752–11762, 2022.
- Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105–4113. PMLR, 2018.
- Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013. URL http://arxiv.org/abs/1306.2597.
- How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM conference on Information and knowledge management, pp. 43–52, 2008.
- Dueling bandits: Beyond condorcet winners to general tournament solutions. Advances in Neural Information Processing Systems, 29, 2016.
- Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. In International Conference on Machine Learning, pp. 19011–19026. PMLR, 2022.
- Adversarial dueling bandits. In International Conference on Machine Learning, pp. 9235–9244. PMLR, 2021.
- Nonstochastic multiarmed bandits with unrestricted delays. Advances in Neural Information Processing Systems, 32, 2019.
- Generic exploration and k-armed voting bandits. In International Conference on Machine Learning, pp. 91–99. PMLR, 2013.
- Delayed feedback in kernel bandits. In International Conference on Machine Learning, pp. 34779–34792. PMLR, 2023.
- Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.
- Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pp. 9712–9721. PMLR, 2020.
- A nonparametric delayed feedback model for conversion rate prediction. arXiv preprint arXiv:1802.00255, 2018.
- Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1201–1208, 2009.
- Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 241–248, 2011.
- The k-armed dueling bandits problem. In Proceedings of the 22nd Conference on Learning Theory, 2009.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- Learning in generalized linear contextual bandits with stochastic delays. Advances in Neural Information Processing Systems, 32, 2019.
- An optimal algorithm for adversarial bandits with arbitrary delays. In International Conference on Artificial Intelligence and Statistics, pp. 3285–3294. PMLR, 2020.
- Relative upper confidence bound for the k-armed dueling bandit problem. ArXiv, abs/1312.3393, 2013.
- Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning, pp. 10–18. PMLR, 2014a.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pp. 73–82, 2014b.
- Copeland dueling bandits. Advances in neural information processing systems, 28, 2015a.
- Mergerucb: A method for large-scale online ranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 17–26, 2015b.