Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction (2402.02171v2)

Published 3 Feb 2024 in stat.ML and cs.LG

Abstract: We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.
  2. Diffusion-based representation learning. arXiv preprint arXiv:2105.14257 (2021).
  3. Laser: Learning a latent action space for efficient reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6650–6656.
  4. Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 129–138.
  5. The extreme classification repository: Multi-label datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.html
  6. Learning action representations for reinforcement learning. In International conference on machine learning. PMLR, 941–950.
  7. Generative Slate Recommendation with Reinforcement Learning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 580–588.
  8. Marginal Posterior Sampling for Slate Bandits.. In IJCAI. 2223–2229.
  9. Doubly Robust Policy Evaluation and Optimization. Statist. Sci. 29, 4 (2014), 485–511.
  10. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, Madison, WI, USA, 1097–1104.
  11. More Robust Doubly Robust Off-Policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80. PMLR, 1447–1456.
  12. Christian Fong and Justin Grimmer. 2021. Causal inference with latent treatments. American Journal of Political Science (2021).
  13. SLATEQ: a tractable decomposition for reinforcement learning with recommendation sets. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2592–2599.
  14. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  15. Accelerating Offline Reinforcement Learning Application in Real-Time Bidding and Recommendation: Potential Use of Simulation. arXiv preprint arXiv:2109.08331 (2021).
  16. SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation. arXiv preprint arXiv:2311.18206 (2023).
  17. Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. In International Conference on Learning Representations.
  18. Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 487–497.
  19. Off-Policy Evaluation of Ranking Policies under Diverse User Behavior. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1154–1163.
  20. Offline Evaluation of Ranking Policies with Click Models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1685–1694.
  21. Causal effect inference with deep latent-variable models. Advances in neural information processing systems 30 (2017).
  22. Nested Bandits. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 15093–15121.
  23. Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1779–1788.
  24. Subgaussian and Differentiable Importance Sampling for Off-Policy Evaluation and Learning. 34 (2021).
  25. Action Abstractions for Combinatorial Multi-Armed Bandit Tree Search. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 14. 74–80.
  26. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
  27. Yuta Saito and Thorsten Joachims. 2021. Counterfactual Learning and Evaluation for Recommender Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.
  28. Yuta Saito and Thorsten Joachims. 2022a. Counterfactual Evaluation and Learning for Interactive Systems: Foundations, Implementations, and Recent Advances. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4824–4825.
  29. Yuta Saito and Thorsten Joachims. 2022b. Off-Policy Evaluation for Large Action Spaces via Embeddings. In Proceedings of the 39th International Conference on Machine Learning. 19089–19122.
  30. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. In International Conference on Machine Learning. PMLR, 29734–29759.
  31. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems. 114–123.
  32. Top-k Extreme Contextual Bandits with Arm Hierarchy. In International Conference on Machine Learning. PMLR, 9422–9433.
  33. Aleksandrs Slivkins. 2011. Multi-armed bandits on implicit metric spaces. Advances in Neural Information Processing Systems 24 (2011).
  34. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, Vol. 23. 2217–2225.
  35. Doubly Robust Off-Policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, Vol. 119. PMLR, 9167–9176.
  36. Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 9196–9205.
  37. Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, Vol. 84. 6005–6014.
  38. Off-Policy Evaluation for Slate Recommendation. In Advances in Neural Information Processing Systems, Vol. 30. 3632–3642.
  39. High Confidence Policy Improvement, In Proceedings of the 32th International Conference on Machine Learning. ICML, 2380–2388.
  40. George Tucker and Jonathan Lee. 2021. Improved Estimator Selection for Off-Policy Evaluation.
  41. Policy-Adaptive Estimator Selection for Off-Policy Evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.
  42. Adapting text embeddings for causal inference. In Conference on Uncertainty in Artificial Intelligence. PMLR, 919–928.
  43. Control Variates for Slate Off-Policy Evaluation. arXiv preprint arXiv:2106.07914 (2021).
  44. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits, In Proceedings of the 34th International Conference on Machine Learning. ICML, 3589–3597.
  45. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning (1992), 5–32.
  46. Challenges of using text classifiers for causal inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2018. NIH Public Access, 4586.
  47. Xingyi Yang and Xinchao Wang. 2023. Diffusion Model as Representation Learner. arXiv preprint arXiv:2308.10916 (2023).
  48. Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1208–1218.
  49. Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning. PMLR, 1719–1735.
Citations (5)

Summary

We haven't generated a summary for this paper yet.