Optimal Baseline Corrections for Off-Policy Contextual Bandits (2405.05736v2)
Abstract: The off-policy learning paradigm allows for recommender systems and general ranking applications to be framed as decision-making problems, where we aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. With unbiasedness comes potentially high variance, and prevalent methods exist to reduce estimation variance. These methods typically make use of control variates, either additive (i.e., baseline corrections or doubly robust methods) or multiplicative (i.e., self-normalisation). Our work unifies these approaches by proposing a single framework built on their equivalence in learning scenarios. The foundation of our framework is the derivation of an equivalent baseline correction for all of the existing control variates. Consequently, our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it. This optimal estimator brings significantly improved performance in both evaluation and learning, and minimizes data requirements. Empirical observations corroborate our theoretical findings.
- Variance Reduction in Ratio Metrics for Efficient Online Experiments. In Proc. of the 46th European Conference on Information Retrieval (ECIR ’24). Springer.
- Carousel Personalization in Music Streaming Apps with Contextual Bandits. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys ’20). ACM, 420–425. https://doi.org/10.1145/3383313.3412217
- The Netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. 35.
- Optimization methods for large-scale machine learning. SIAM review 60, 2 (2018), 223–311.
- Let’s Get It Started: Fostering the Discoverability of New Releases on Deezer. In Proc. of the 46th European Conference on Information Retrieval (ECIR ’24). Springer.
- Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 55–63. https://doi.org/10.1145/3159652.3159699
- Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999
- User Response Models to Improve a REINFORCE Recommender System. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21). ACM, New York, NY, USA, 121–129. https://doi.org/10.1145/3437963.3441764
- Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 338–349. https://doi.org/10.1145/3523227.3546758
- Peter Dayan. 1991. Reinforcement Comparison. In Connectionist Models, David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton (Eds.). Morgan Kaufmann, 45–51. https://doi.org/10.1016/B978-1-4832-1448-1.50011-1
- Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. In Proc. of the Sixth ACM International Conference on Web Search and Data Mining (WSDM ’13). ACM, 123–132. https://doi.org/10.1145/2433396.2433413
- Counterfactual learning for recommender system. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys ’20). ACM, 568–569. https://doi.org/10.1145/3383313.3411552
- Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485 – 511. https://doi.org/10.1214/14-STS500
- More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html
- David A. Freedman. 2008. On regression adjustments to experimental data. Advances in Applied Mathematics 40, 2 (2008), 180–193. https://doi.org/10.1016/j.aam.2006.12.003
- Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 5 (dec 2004), 1471–1530.
- Unbiased Learning to Rank: On Recent Advances and Practical Applications. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1118–1121. https://doi.org/10.1145/3616855.3636451
- A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 87–93.
- Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258.
- Edward L Ionides. 2008. Truncated Importance Sampling. Journal of Computational and Graphical Statistics 17, 2 (2008), 295–311. https://doi.org/10.1198/106186008X320456 arXiv:https://doi.org/10.1198/106186008X320456
- Olivier Jeunen. 2021. Offline Approaches to Recommendation with Online Success. Ph. D. Dissertation. University of Antwerp.
- Olivier Jeunen and Bart Goethals. 2020. An Empirical Evaluation of Doubly Robust Learning for Recommendation. In Proc. of the ACM RecSys Workshop on Bandit Learning from User Interactions (REVEAL ’20).
- Olivier Jeunen and Bart Goethals. 2021a. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247
- Olivier Jeunen and Bart Goethals. 2021b. Top-K Contextual Bandits with Equity of Exposure. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 310–320. https://doi.org/10.1145/3460231.3474248
- Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029
- CONSEQUENCES—Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems. 654–657.
- Off-Policy Learning-to-Bid with AuctionGym. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). ACM, 4219–4228. https://doi.org/10.1145/3580305.3599877
- Joint Policy-Value Learning for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 1223–1233. https://doi.org/10.1145/3394486.3403175
- Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 1199–1201.
- Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
- Augustine Kong. 1992. A Note on Importance Sampling Using Standardized Weights. Technical Report 348. University of Chicago, Dept. of Statistics.
- Tor Lattimore and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press.
- Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643 (2020).
- A Contextual-bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (Raleigh, North Carolina, USA) (WWW ’10). ACM, 661–670. https://doi.org/10.1145/1772690.1772758
- Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 1141–1151. https://doi.org/10.1145/3534678.3539295
- Self-Normalized Off-Policy Estimators for Ranking. In CONSEQUENCES Workshop at ACM RecSys ’23 (CONSEQUENCES ’23).
- Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130
- Explore, Exploit, and Explain: Personalizing Explainable Recommendations with Bandits. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, 31–39. https://doi.org/10.1145/3240323.3240354
- Bandit based Optimization of Multiple Objectives on a Music Streaming Platform. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 3224–3233. https://doi.org/10.1145/3394486.3403374
- Monte Carlo Gradient Estimation in Machine Learning. J. Mach. Learn. Res. 21, 1, Article 132 (jan 2020), 62 pages.
- Art B. Owen. 2013. Monte Carlo Theory, Methods and Examples. https://artowen.su.domains/mc/.
- Boosted Decision Tree Regression Adjustment for Variance Reduction in Online Controlled Experiments. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 235–244. https://doi.org/10.1145/2939672.2939688
- Steffen Rendle. 2022. Item Recommendation from Implicit Feedback. Springer US, New York, NY, 143–171. https://doi.org/10.1007/978-1-0716-2197-4_4
- RecoGym: A Reinforcement Learning Environment for the Problem of Product Recommendation in Online Advertising. arXiv preprint arXiv:1808.00720 (2018).
- Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
- Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. https://doi.org/10.1145/3460231.3473320
- Yuta Saito and Thorsten Joachims. 2022. Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4824–4825.
- Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 114–123. https://doi.org/10.1145/3460231.3474245
- BLOB: A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 783–793. https://doi.org/10.1145/3394486.3403121
- High-Dimensional Continuous Control Using Generalized Advantage Estimation. In Proceedings of the International Conference on Learning Representations (ICLR).
- An MDP-based Recommender System. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (Alberta, Canada) (UAI’02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 453–460.
- Harald Steck. 2013. Evaluation of Recommendations: Rating-prediction and Ranking. In Proc. of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, 213–220. https://doi.org/10.1145/2507157.2507160
- Doubly Robust Off-policy Evaluation with Shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.
- CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. In Proc. of the 36th International Conference on Machine Learning (ICML ’19, Vol. 97). PMLR, 6005–6014. https://proceedings.mlr.press/v97/su19a.html
- Long-Term Value of Exploration: Measurements, Findings and Algorithms. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 636–644. https://doi.org/10.1145/3616855.3635833
- Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
- Adith Swaminathan and Thorsten Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
- Adith Swaminathan and Thorsten Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf
- Practical Bandits: An Industry Perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. https://doi.org/10.1145/3616855.3636449
- A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. https://doi.org/10.1145/3340631.3398666
- Ronald J. Williams. 1988. Toward a Theory of Reinforcement-learning Connectionist Systems. Technical Report NU-CCS-88-3. Northeastern University.
- Ronald J. Williams. 1992. Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning. Machine Learning 8, 3 (01 May 1992), 229–256. https://doi.org/10.1007/BF00992696
- Online Matching: A Real-time Bandit System for Large-scale Recommendations. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM, New York, NY, USA, 403–414. https://doi.org/10.1145/3604915.3608792