Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces (2308.03443v3)
Abstract: We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. Nevertheless, MIPS is unbiased under the no direct effect, which assumes that the action embedding completely mediates the effect of an action on a reward. To overcome the dependency on these unrealistic assumptions, we propose a Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while reducing the variance against MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators with large action spaces.
- Reinforcement learning: Theory and algorithms. MIT, 2020.
- The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 129–138, 2009.
- Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
- More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1447–1456. PMLR, 2018.
- Guidelines for reinforcement learning in helathcare. Nature Medicine, 2019.
- Offline evaluation to make decisions about playlist recommendation algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, pp. 420–428, 2019.
- A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
- Recommendations as treatments. In AAAI AI Magazine, pp. vol 42, number 3, pages 19–30, 2021.
- Exploration scavenging. In Proceedings of the 25th international conference on Machine learning, pp. 528–535, 2008.
- Bandit algorithms. Cambridge, 2020.
- Subgaussian and differentiable importance sampling for off-policy evaluation and learning. Advances in Neural Information Processing Systems, 34, 2021.
- Debiased off-policy evaluation for recommendation systems. In ACM Conference on Recommender Systems, 2020.
- Counterfactual learning and evaluation for recommender systems: Foundations, implementations, and recent advances. In Proceedings of the 15th ACM Conference on Recommender Systems, pp. 828–830, 2021.
- Off-policy evaluation for large action spaces via embeddings. In International Conference on Machine Learning, pp. 19089–19122. PMLR, 2022.
- Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation. arXiv preprint arXiv:2008.07146, 2020a.
- Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 501–509, 2020b.
- Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, volume 23, pp. 2217–2225, 2010.
- Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, volume 84, pp. 6005–6014, 2019.
- Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pp. 9196–9205. PMLR, 2020.
- Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814–823. PMLR, 2015.
- Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pp. 3589–3597. PMLR, 2017.