Long-term Off-Policy Evaluation and Learning (2404.15691v1)
Abstract: Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
- Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. KDD (2017), 687–696.
- Exponential Smoothing for Off-Policy Learning. arXiv preprint arXiv:2305.15877 (2023).
- Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676 (2020).
- The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical Report. National Bureau of Economic Research.
- Using survival models to estimate user engagement in online experiments. In Proceedings of the ACM Web Conference 2022. 3186–3195.
- Jiafeng Chen and David M Ritzwoller. 2021. Semiparametric estimation of long-term treatment effects. arXiv preprint arXiv:2107.14405 (2021).
- Long-term effect estimation with surrogate representation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 274–282.
- Learning Action Embeddings for Off-Policy Evaluation. arXiv preprint arXiv:2305.03954 (2023).
- Alex Deng and Xiaolin Shi. 2016. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 77–86.
- Online Experimentation with Surrogate Metrics: Guidelines and a Case Study. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 193–201.
- Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.
- Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104.
- More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447–1456.
- Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research. Statistics in medicine 13, 9 (1994), 955–968.
- Offline Evaluation to Make Decisions About Playlist Recommendation Algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 420–428.
- Focus on the Long-Term: It’s better for Users and Business. In Proceedings 21st Conference on Knowledge Discovery and Data Mining. Sydney, Australia.
- Estimating Effects of Long-Term Treatments. arXiv preprint arXiv:2308.08152 (2023).
- Long-term causal inference under persistent confounding via data combination. arXiv preprint arXiv:2202.07234 (2022).
- Nathan Kallus and Xiaojie Mao. 2020. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408 (2020).
- Optimal Off-Policy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256.
- Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In International Conference on Learning Representations.
- Off-policy evaluation of slate bandit policies via optimizing abstraction. arXiv preprint arXiv:2402.02171 (2024).
- Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 487–497.
- Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163.
- Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1687–1697.
- Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems 34 (2021).
- Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023. 1220–1230.
- Ross L Prentice. 1989. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 8, 4 (1989), 431–440.
- Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
- Off-Policy Bandits with Deficient Support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 965–975.
- Off-policy evaluation for large action spaces via policy convolution. arXiv preprint arXiv:2310.15433 (2023).
- Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
- Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.
- Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning. 19089–19122.
- Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In International Conference on Machine Learning. PMLR, 29734–29759.
- POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition. arXiv preprint arXiv:2402.06151 (2024).
- Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176.
- Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. 6005–6014.
- A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355 (2022).
- Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. arXiv preprint arXiv:2302.10625 (2023).
- Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. arXiv preprint arXiv:1911.06854 (2019).
- Surrogate for long-term user experience in recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4100–4109.
- Optimal and Adaptive Off-policy Evaluation in Contextual Bandits, In Proceedings of the 34th International Conference on Machine Learning. ICML, 3589–3597.
- Targeting for long-term outcomes. arXiv preprint arXiv:2010.15835 (2020).