$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies (2405.10024v2)
Abstract: The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $\Delta\text{-}{\rm OPE}$. $\Delta\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $\Delta\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
- Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 101 (2013), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html
- Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999
- User Response Models to Improve a REINFORCE Recommender System. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21). ACM, 121–129. https://doi.org/10.1145/3437963.3441764
- Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 338–349. https://doi.org/10.1145/3523227.3546758
- Peter Dayan. 1991. Reinforcement Comparison. In Connectionist Models, David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton (Eds.). Morgan Kaufmann, 45–51. https://doi.org/10.1016/B978-1-4832-1448-1.50011-1
- Counterfactual learning for recommender system. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys ’20). ACM, 568–569. https://doi.org/10.1145/3383313.3411552
- Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485 – 511. https://doi.org/10.1214/14-STS500
- More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html
- Distributionally Robust Counterfactual Risk Minimization. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3850–3857. https://doi.org/10.1609/aaai.v34i04.5797
- Offline A/B Testing for Recommender Systems. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687
- Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 5 (dec 2004), 1471–1530.
- Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 420–428. https://doi.org/10.1145/3289600.3291027
- Unbiased Learning to Rank: On Recent Advances and Practical Applications. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1118–1121. https://doi.org/10.1145/3616855.3636451
- Optimal Baseline Corrections for Off-Policy Contextual Bandits. arXiv:2405.05736 [cs.LG]
- A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 87–93.
- Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258.
- Daniel G. Horvitz and Donovan J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663–685. https://doi.org/10.1080/01621459.1952.10483446
- Olivier Jeunen. 2021. Offline approaches to recommendation with online success. Ph. D. Dissertation. University of Antwerp.
- Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247
- Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029
- CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 654–657. https://doi.org/10.1145/3523227.3547409
- Olivier Jeunen and Ben London. 2023. Offline Recommender System Evaluation under Unobserved Confounding. arXiv:2309.04222 [cs.LG]
- Multi-Objective Recommendation via Multivariate Policy Learning. arXiv:2405.02141 [cs.IR]
- Joint Policy-Value Learning for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 1223–1233. https://doi.org/10.1145/3394486.3403175
- Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. arXiv:2402.03915 [cs.LG]
- Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb
- REVEAL 2018: offline evaluation for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 514–515. https://doi.org/10.1145/3240323.3240334
- Augustine Kong. 1992. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep 348 (1992).
- Erich L Lehmann and Joseph P Romano. 2005. Testing statistical hypotheses.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW ’10). ACM, 661–670. https://doi.org/10.1145/1772690.1772758
- A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 831–840. https://doi.org/10.1145/3397271.3401083
- Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 1141–1151. https://doi.org/10.1145/3534678.3539295
- Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130
- Learning a Voice-based Conversational Recommender using Offline Policy Optimization. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 562–564. https://doi.org/10.1145/3460231.3474600
- Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050 (2009), 21.
- Art B. Owen. 2013. Monte Carlo theory, methods and examples.
- Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. https://doi.org/10.1145/3616855.3635846
- Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1.
- Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. https://doi.org/10.1145/3460231.3473320
- Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html
- Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG]
- Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 8884–8894. https://proceedings.mlr.press/v119/si20a.html
- Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.
- Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
- Adith Swaminathan and Thorsten Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf
- Practical Bandits: An Industry Perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. https://doi.org/10.1145/3616855.3636449
- A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. https://doi.org/10.1145/3340631.3398666
- Recommender Systems Through the Lens of Decision Theory. In Proceedings of the 30th World Wide Web Conference ACM Conference.
- Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning. PMLR, 22491–22511.
- Ronald J Williams. 1988. Toward a theory of reinforcement-learning connectionist systems. Technical Report (1988).
- Unbiased Offline Recommender Evaluation for Missing-Not-at-Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, 279–287. https://doi.org/10.1145/3240323.3240355