Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributional Off-Policy Evaluation for Slate Recommendations (2308.14165v2)

Published 27 Aug 2023 in cs.IR, cs.AI, and cs.LG

Abstract: Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Best Arm Identification for Contaminated Bandits. J. Mach. Learn. Res., 20(91): 1–39.
  2. Recommender systems survey. Knowledge-based systems, 46: 109–132.
  3. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, 89–96.
  4. Combinatorial bandits. Journal of Computer and System Sciences, 78(5): 1404–1422.
  5. Universal off-policy evaluation. Advances in Neural Information Processing Systems, 34.
  6. Learning action representations for reinforcement learning. In International conference on machine learning, 941–950. PMLR.
  7. Doubly robust policy evaluation and optimization. Statistical Science, 29(4): 485–511.
  8. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601.
  9. Learning representations of hierarchical slates in collaborative filtering. In Proceedings of the 14th ACM Conference on Recommender Systems, 703–707.
  10. Development and deployment at facebook. IEEE Internet Computing, 17(4): 8–17.
  11. The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4): 1–19.
  12. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4): 1–19.
  13. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260): 663–685.
  14. Off-policy risk assessment in contextual bandits. Advances in Neural Information Processing Systems, 34: 23714–23726.
  15. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets.
  16. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, volume 51, 243–250. ACM New York, NY, USA.
  17. Non-stochastic bandit slate problems. Advances in Neural Information Processing Systems, 23.
  18. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. Advances in neural information processing systems, 32.
  19. Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 4436–4443.
  20. Doubly robust off-policy evaluation for ranking policies under the cascade behavior model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 487–497.
  21. Online Controlled Experiments and A/B Testing. Encyclopedia of machine learning and data mining, 7(8): 922–929.
  22. Probabilistic graphical models: principles and techniques. MIT press.
  23. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, 535–543. PMLR.
  24. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670.
  25. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1685–1694.
  26. Recommender system application developments: a survey. Decision support systems, 74: 12–32.
  27. Counterfactual evaluation of slate recommendations with sequential reward interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1779–1788.
  28. Optimization of conditional value-at-risk. Journal of risk, 2: 21–42.
  29. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146.
  30. Analysis of recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM Conference on Electronic Commerce, 158–167.
  31. Large sample methods in statistics: an introduction with applications, volume 25. CRC press.
  32. Evaluating recommendation systems. In Recommender systems handbook, 257–297. Springer.
  33. Steck, H. 2019. Embarrassingly shallow autoencoders for sparse data. In The World Wide Web Conference, 3251–3257.
  34. Stephens, M. A. 1974. EDF statistics for goodness of fit and some comparisons. Journal of the American statistical Association, 69(347): 730–737.
  35. Deep reinforcement learning with attention for slate markov decision processes with high-dimensional states and actions. arXiv preprint arXiv:1512.01124.
  36. Reinforcement learning: An introduction. MIT press.
  37. Off-policy evaluation for slate recommendation. Advances in Neural Information Processing Systems, 30.
  38. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
  39. Thomas, P. S. 2015. Safe reinforcement learning.
  40. Control variates for slate off-policy evaluation. Advances in Neural Information Processing Systems, 34.
  41. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, 3589–3597. PMLR.
  42. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, 1113–1122. PMLR.
  43. Distortion risk measures: Coherence and stochastic dominance. In International congress on insurance: Mathematics and economics, 15–17.

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com