Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays (2407.15439v3)
Abstract: We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- J. Bretagnolle and C. Huber. Estimation des densités: risque minimax. Séminaire de probabilités de Strasbourg, 12:342–363, 1978.
- O. Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105, 2014.
- Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159. PMLR, 2013.
- Combinatorial multi-armed bandit with general reward functions. Advances in Neural Information Processing Systems, 29, 2016a.
- Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016b.
- Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, page 169, 2011.
- Stochastic bandits with arm-dependent delays. In International Conference on Machine Learning, pages 3348–3356. PMLR, 2020.
- Dependent rounding and its applications to approximation algorithms. Journal of the ACM (JACM), 53(3):324–360, 2006.
- O. Jeunen and B. Goethals. Top-k contextual bandits with equity of exposure. In Proceedings of the 15th ACM Conference on Recommender Systems, pages 310–320, 2021.
- Fairness in learning: Classic and contextual bandits. Advances in Neural Information Processing Systems, 29, 2016.
- Meritocratic fairness for infinite and contextual bandits. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 158–163, 2018.
- Online learning under delayed feedback. In International Conference on Machine Learning, pages 1453–1461. PMLR, 2013.
- Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In International Conference on Machine Learning, pages 1152–1161. PMLR, 2015.
- Stochastic multi-armed bandits with unrestricted delay distributions. In International Conference on Machine Learning, pages 5969–5978. PMLR, 2021.
- T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Combinatorial sleeping bandits with fairness constraints. IEEE Transactions on Network Science and Engineering, 7(3):1799–1813, 2019.
- Calibrated fairness in bandits. arXiv preprint arXiv:1707.01875, 2017.
- The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
- Exposure-aware recommendation using contextual bandits. In 5th FAccTRec Workshop: Responsible Recommendation. Association for Computing Machinery (ACM), 2022.
- O. Marchal and J. Arbel. On the sub-gaussianity of the beta and dirichlet distributions. Electronic Communications in Probability, 22(paper no. 54):1–14, 2017.
- Achieving fairness in the stochastic multi-armed bandit problem. The Journal of Machine Learning Research, 22(1):7885–7915, 2021.
- Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pages 4105–4113. PMLR, 2018.
- D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Group fairness in bandits with biased feedback. In 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, 2022.
- A. Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- Learning from delayed semi-bandit feedback under strong fairness guarantees. In IEEE Conference on Computer Communications (IEEE INFOCOM), pages 1379–1388. IEEE, 2022.
- M. Tallis and P. Yadav. Reacting to variations in product demand: An application for conversion rate (cr) prediction in sponsored search. In IEEE International Conference on Big Data (Big Data), pages 1856–1864. IEEE, 2018.
- W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Stochastic bandit models for delayed conversions. In Conference on Uncertainty in Artificial Intelligence, 2017.
- Fairness of exposure in stochastic bandits. In International Conference on Machine Learning, pages 10686–10696. PMLR, 2021.
- Q. Wang and W. Chen. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. Advances in Neural Information Processing Systems, 30, 2017.
- H. Wu and S. Wager. Thompson sampling with unrestricted delays. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 937–955, 2022.