Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies (2405.10024v2)

Published 16 May 2024 in cs.LG and cs.IR

Abstract: The off-policy paradigm casts recommendation as a counterfactual decision-making task, allowing practitioners to unbiasedly estimate online metrics using offline data. This leads to effective evaluation metrics, as well as learning procedures that directly optimise online success. Nevertheless, the high variance that comes with unbiasedness is typically the crux that complicates practical applications. An important insight is that the difference between policy values can often be estimated with significantly reduced variance, if said policies have positive covariance. This allows us to formulate a pairwise off-policy estimation task: $\Delta\text{-}{\rm OPE}$. $\Delta\text{-}{\rm OPE}$ subsumes the common use-case of estimating improvements of a learnt policy over a production policy, using data collected by a stochastic logging policy. We introduce $\Delta\text{-}{\rm OPE}$ methods based on the widely used Inverse Propensity Scoring estimator and its extensions. Moreover, we characterise a variance-optimal additive control variate that further enhances efficiency. Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 101 (2013), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html
  2. Top-K Off-Policy Correction for a REINFORCE Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 456–464. https://doi.org/10.1145/3289600.3290999
  3. User Response Models to Improve a REINFORCE Recommender System. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM ’21). ACM, 121–129. https://doi.org/10.1145/3437963.3441764
  4. Off-Policy Actor-Critic for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 338–349. https://doi.org/10.1145/3523227.3546758
  5. Peter Dayan. 1991. Reinforcement Comparison. In Connectionist Models, David S. Touretzky, Jeffrey L. Elman, Terrence J. Sejnowski, and Geoffrey E. Hinton (Eds.). Morgan Kaufmann, 45–51. https://doi.org/10.1016/B978-1-4832-1448-1.50011-1
  6. Counterfactual learning for recommender system. In Proceedings of the 14th ACM Conference on Recommender Systems (RecSys ’20). ACM, 568–569. https://doi.org/10.1145/3383313.3411552
  7. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485 – 511. https://doi.org/10.1214/14-STS500
  8. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1447–1456. https://proceedings.mlr.press/v80/farajtabar18a.html
  9. Distributionally Robust Counterfactual Risk Minimization. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3850–3857. https://doi.org/10.1609/aaai.v34i04.5797
  10. Offline A/B Testing for Recommender Systems. In Proc. of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, 198–206. https://doi.org/10.1145/3159652.3159687
  11. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. J. Mach. Learn. Res. 5 (dec 2004), 1471–1530.
  12. Offline Evaluation to Make Decisions About PlaylistRecommendation Algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM ’19). ACM, 420–428. https://doi.org/10.1145/3289600.3291027
  13. Unbiased Learning to Rank: On Recent Advances and Practical Applications. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1118–1121. https://doi.org/10.1145/3616855.3636451
  14. Optimal Baseline Corrections for Off-Policy Contextual Bandits. arXiv:2405.05736 [cs.LG]
  15. A Deep Generative Recommendation Method for Unbiased Learning from Implicit Feedback. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 87–93.
  16. Safe deployment for counterfactual learning to rank with exposure-based risk minimization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 249–258.
  17. Daniel G. Horvitz and Donovan J. Thompson. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663–685. https://doi.org/10.1080/01621459.1952.10483446
  18. Olivier Jeunen. 2021. Offline approaches to recommendation with online success. Ph. D. Dissertation. University of Antwerp.
  19. Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off-Policy Learning in Recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 63–74. https://doi.org/10.1145/3460231.3474247
  20. Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recommender Systems. ACM Trans. Recomm. Syst. 1, 1, Article 4 (feb 2023), 27 pages. https://doi.org/10.1145/3568029
  21. CONSEQUENCES — Causality, Counterfactuals and Sequential Decision-Making for Recommender Systems. In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 654–657. https://doi.org/10.1145/3523227.3547409
  22. Olivier Jeunen and Ben London. 2023. Offline Recommender System Evaluation under Unobserved Confounding. arXiv:2309.04222 [cs.LG]
  23. Multi-Objective Recommendation via Multivariate Policy Learning. arXiv:2405.02141 [cs.IR]
  24. Joint Policy-Value Learning for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). ACM, 1223–1233. https://doi.org/10.1145/3394486.3403175
  25. Olivier Jeunen and Aleksei Ustimenko. 2024. Learning Metrics that Maximise Power for Accelerated A/B-Tests. arXiv:2402.03915 [cs.LG]
  26. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations. https://openreview.net/forum?id=SJaP_-xAb
  27. REVEAL 2018: offline evaluation for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 514–515. https://doi.org/10.1145/3240323.3240334
  28. Augustine Kong. 1992. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep 348 (1992).
  29. Erich L Lehmann and Joseph P Romano. 2005. Testing statistical hypotheses.
  30. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW ’10). ACM, 661–670. https://doi.org/10.1145/1772690.1772758
  31. A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 831–840. https://doi.org/10.1145/3397271.3401083
  32. Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). ACM, 1141–1151. https://doi.org/10.1145/3534678.3539295
  33. Off-Policy Learning in Two-Stage Recommender Systems. In Proceedings of The Web Conference 2020 (WWW ’20). ACM, 463–473. https://doi.org/10.1145/3366423.3380130
  34. Learning a Voice-based Conversational Recommender using Offline Policy Optimization. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 562–564. https://doi.org/10.1145/3460231.3474600
  35. Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein Bounds and Sample Variance Penalization. Stat. 1050 (2009), 21.
  36. Art B. Owen. 2013. Monte Carlo theory, methods and examples.
  37. Ad-load Balancing via Off-policy Learning in a Content Marketplace. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 586–595. https://doi.org/10.1145/3616855.3635846
  38. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1.
  39. Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proc. of the 15th ACM Conference on Recommender Systems (RecSys ’21). ACM, 828–830. https://doi.org/10.1145/3460231.3473320
  40. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889–1897. https://proceedings.mlr.press/v37/schulman15.html
  41. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG]
  42. Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 8884–8894. https://proceedings.mlr.press/v119/si20a.html
  43. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.
  44. Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research 16, 1 (2015), 1731–1755.
  45. Adith Swaminathan and Thorsten Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, Vol. 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/39027dfad5138c9ca0c474d71db915c3-Paper.pdf
  46. Practical Bandits: An Industry Perspective. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 1132–1135. https://doi.org/10.1145/3616855.3636449
  47. A Gentle Introduction to Recommendation as Counterfactual Policy Learning. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP ’20). ACM, 392–393. https://doi.org/10.1145/3340631.3398666
  48. Recommender Systems Through the Lens of Decision Theory. In Proceedings of the 30th World Wide Web Conference ACM Conference.
  49. Safe exploration for efficient policy evaluation and comparison. In International Conference on Machine Learning. PMLR, 22491–22511.
  50. Ronald J Williams. 1988. Toward a theory of reinforcement-learning connectionist systems. Technical Report (1988).
  51. Unbiased Offline Recommender Evaluation for Missing-Not-at-Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, 279–287. https://doi.org/10.1145/3240323.3240355
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets