Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-term Off-Policy Evaluation and Learning (2404.15691v1)

Published 24 Apr 2024 in cs.LG and stat.ML

Abstract: Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. KDD (2017), 687–696.
  2. Exponential Smoothing for Off-Policy Learning. arXiv preprint arXiv:2305.15877 (2023).
  3. Combining experimental and observational data to estimate treatment effects on long term outcomes. arXiv preprint arXiv:2006.09676 (2020).
  4. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical Report. National Bureau of Economic Research.
  5. Using survival models to estimate user engagement in online experiments. In Proceedings of the ACM Web Conference 2022. 3186–3195.
  6. Jiafeng Chen and David M Ritzwoller. 2021. Semiparametric estimation of long-term treatment effects. arXiv preprint arXiv:2107.14405 (2021).
  7. Long-term effect estimation with surrogate representation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 274–282.
  8. Learning Action Embeddings for Off-Policy Evaluation. arXiv preprint arXiv:2305.03954 (2023).
  9. Alex Deng and Xiaolin Shi. 2016. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 77–86.
  10. Online Experimentation with Surrogate Metrics: Guidelines and a Case Study. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 193–201.
  11. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.
  12. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104.
  13. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447–1456.
  14. Surrogate and auxiliary endpoints in clinical trials, with potential applications in cancer and AIDS research. Statistics in medicine 13, 9 (1994), 955–968.
  15. Offline Evaluation to Make Decisions About Playlist Recommendation Algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 420–428.
  16. Focus on the Long-Term: It’s better for Users and Business. In Proceedings 21st Conference on Knowledge Discovery and Data Mining. Sydney, Australia.
  17. Estimating Effects of Long-Term Treatments. arXiv preprint arXiv:2308.08152 (2023).
  18. Long-term causal inference under persistent confounding via data combination. arXiv preprint arXiv:2202.07234 (2022).
  19. Nathan Kallus and Xiaojie Mao. 2020. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408 (2020).
  20. Optimal Off-Policy Evaluation from Multiple Logging Policies. In Proceedings of the 38th International Conference on Machine Learning, Vol. 139. PMLR, 5247–5256.
  21. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In International Conference on Learning Representations.
  22. Off-policy evaluation of slate bandit policies via optimizing abstraction. arXiv preprint arXiv:2402.02171 (2024).
  23. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 487–497.
  24. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163.
  25. Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1687–1697.
  26. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. Advances in Neural Information Processing Systems 34 (2021).
  27. Offline policy evaluation in large action spaces via outcome-oriented action grouping. In Proceedings of the ACM Web Conference 2023. 1220–1230.
  28. Ross L Prentice. 1989. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine 8, 4 (1989), 431–440.
  29. Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
  30. Off-Policy Bandits with Deficient Support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 965–975.
  31. Off-policy evaluation for large action spaces via policy convolution. arXiv preprint arXiv:2310.15433 (2023).
  32. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146 (2020).
  33. Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.
  34. Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning. 19089–19122.
  35. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In International Conference on Machine Learning. PMLR, 29734–29759.
  36. POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition. arXiv preprint arXiv:2402.06151 (2024).
  37. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176.
  38. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. 6005–6014.
  39. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355 (2022).
  40. Estimating long-term causal effects from short-term experiments and long-term observational data with unobserved confounding. arXiv preprint arXiv:2302.10625 (2023).
  41. Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. arXiv preprint arXiv:1911.06854 (2019).
  42. Surrogate for long-term user experience in recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4100–4109.
  43. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits, In Proceedings of the 34th International Conference on Machine Learning. ICML, 3589–3597.
  44. Targeting for long-term outcomes. arXiv preprint arXiv:2010.15835 (2020).
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets