Ad-load Balancing via Off-policy Learning in a Content Marketplace (2309.11518v2)
Abstract: Ad-load balancing is a critical challenge in online advertising systems, particularly in the context of social media platforms, where the goal is to maximize user engagement and revenue while maintaining a satisfactory user experience. This requires the optimization of conflicting objectives, such as user satisfaction and ads revenue. Traditional approaches to ad-load balancing rely on static allocation policies, which fail to adapt to changing user preferences and contextual factors. In this paper, we present an approach that leverages off-policy learning and evaluation from logged bandit feedback. We start by presenting a motivating analysis of the ad-load balancing problem, highlighting the conflicting objectives between user satisfaction and ads revenue. We emphasize the nuances that arise due to user heterogeneity and the dependence on the user's position within a session. Based on this analysis, we define the problem as determining the optimal ad-load for a particular feed fetch. To tackle this problem, we propose an off-policy learning framework that leverages unbiased estimators such as Inverse Propensity Scoring (IPS) and Doubly Robust (DR) to learn and estimate the policy values using offline collected stochastic data. We present insights from online A/B experiments deployed at scale across over 80 million users generating over 200 million sessions, where we find statistically significant improvements in both user satisfaction metrics and ads revenue for the platform.
- Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction 30, 1 (01 Mar 2020), 127–158. https://doi.org/10.1007/s11257-019-09256-1
- Himan Abdollahpouri and Robin Burke. 2022. Multistakeholder Recommender Systems. Springer US, New York, NY, 647–677. https://doi.org/10.1007/978-1-0716-2197-4_17
- Zoë Abrams and Erik Vee. 2007. Personalized Ad Delivery When Ads Fatigue: An Approximation Algorithm. In Internet and Network Economics, Xiaotie Deng and Fan Chung Graham (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 535–540.
- AdKDD 2022. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 4852–4853. https://doi.org/10.1145/3534678.3542920
- Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 101 (2013), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html
- Multi-objective Bandits: Optimizing the Generalized Gini Index. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 625–634. https://proceedings.mlr.press/v70/busa-fekete17a.html
- Real-Time Bidding by Reinforcement Learning in Display Advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, 661–670. https://doi.org/10.1145/3018661.3018702
- Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999
- Click Models for Web Search. Morgan & Claypool. https://doi.org/10.2200/S00654ED1V01Y201507ICR043
- Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML’11). Omnipress, 1097–1104.
- More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html
- Distributionally Robust Counterfactual Risk Minimization. In Proc. of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). AAAI Press.
- Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD’14). ACM, 1–9. https://doi.org/10.1145/2648584.2648589
- Measuring consumer sensitivity to audio advertising: A field experiment on pandora internet radio. Available at SSRN 3166676 (2018).
- Edward L. Ionides. 2008. Truncated Importance Sampling. Journal of Computational and Graphical Statistics 17, 2 (2008), 295–311.
- Olivier Jeunen. 2023. A Probabilistic Position Bias Model for Short-Video Recommendation Feeds. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM, 675–681. https://doi.org/10.1145/3604915.3608777
- Olivier Jeunen and Bart Goethals. 2020. An Empirical Evaluation of Doubly Robust Learning for Recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions (REVEAL ’20).
- Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247
- Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029
- CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 654–657. https://doi.org/10.1145/3523227.3547409
- Olivier Jeunen and Ben London. 2023. Offline Recommender System Evaluation under Unobserved Confounding. In RecSys 2023 Workshop: CONSEQUENCES – Causality, Counterfactuals and Sequential Decision-Making. arXiv:2309.04222
- Learning to Bid with AuctionGym. In Proceedings of the Workshop on Knowledge Discovery and Data Mining for Online Advertising (ADKDD ’22).
- Off-Policy Learning-to-Bid with AuctionGym. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). ACM, 4219–4228. https://doi.org/10.1145/3580305.3599877
- Joint Policy-Value Learning for Recommendation. In Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 1223–1233.
- A Probabilistic Framework for Learning Auction Mechanisms via Gradient Descent. In Proceedings of the Workshop on Artificial Intelligence for Online Advertising (AI4WebAds ’23).
- Deep Learning with Logged Bandit Feedback. In Proc. of the 6th International Conference on Learning Representations (ICLR ’18).
- Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 [cs.LG]
- Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web. 929–934.
- Neural Auction: End-to-End Learning of Auction Mechanisms for E-Commerce Advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). ACM, 3354–3364. https://doi.org/10.1145/3447548.3467103
- Ben London and Thorsten Joachims. 2022. Control variate diagnostics for detecting problems in logged bandit feedback. In RecSys 2022 Workshop: CONSEQUENCES – Causality, Counterfactuals and Sequential Decision-Making.
- Boosted Off-Policy Learning. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (Eds.). PMLR, 5614–5640. https://proceedings.mlr.press/v206/london23a.html
- Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130
- Imitation-Regularized Offline Learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 89). PMLR, 2956–2965. https://proceedings.mlr.press/v89/ma19b.html
- Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050 (2009), 21.
- Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’13). ACM, 1222–1230. https://doi.org/10.1145/2487575.2488200
- Bandit based optimization of multiple objectives on a music streaming platform. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3224–3233.
- Three Methods for Training on Bandit Feedback. In Proc. of the NeurIPS Workshop on Causality and Machine Learning (CausalML ’19).
- Art B. Owen. 2013. Monte Carlo theory, methods and examples.
- Judea Pearl. 2009. Causality. Cambridge university press.
- CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf
- Marc Resnick and William Albert. 2014. The Impact of Advertising Location and User Task on the Emergence of Banner Ad Blindness: An Eye-Tracking Study. International Journal of Human–Computer Interaction 30, 3 (2014), 206–219. https://doi.org/10.1080/10447318.2013.847762
- Quantifying and Leveraging User Fatigue for Interventions in Recommender Systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation. arXiv preprint arXiv:2008.07146 (2020).
- Yuta Saito and Thorsten Joachims. 2022. Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 4824–4825. https://doi.org/10.1145/3534678.3542601
- Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 8884–8894. https://proceedings.mlr.press/v119/si20a.html
- Doubly robust off-policy evaluation with shrinkage. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 9167–9176. https://proceedings.mlr.press/v119/su20a.html
- Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning. PMLR, 6005–6014.
- Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
- Adith Swaminathan and Thorsten Joachims. 2015b. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proc. of the 32nd International Conference on International Conference on Machine Learning (ICML’15). JMLR.org, 814–823.
- Adith Swaminathan and Thorsten Joachims. 2015c. The self-normalized estimator for counterfactual learning. advances in neural information processing systems 28 (2015).
- Adith Swaminathan and Thorsten Joachims. 2015d. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems. 3231–3239.
- Practical Bandits: An Industry Perspective. arXiv:2302.01223 [cs.LG]
- Yong Zheng and David (Xuejun) Wang. 2022. A survey of recommender systems with multi-objective optimization. Neurocomputing 474 (2022), 141–153. https://doi.org/10.1016/j.neucom.2021.11.041