Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits (2401.11353v2)
Abstract: We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits under general covariate shifts. Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution between logging and target data. Central to our methodology is the application of robust regression, a distributionally robust technique tailored here to improve the estimation of conditional reward distribution from logging data. Utilizing the reward model obtained from robust regression, we develop a comprehensive suite of policy value estimators, by integrating our reward model into established evaluation frameworks, namely direct methods and doubly robust methods. Through theoretical analysis, we further establish that the proposed policy value estimators offer a finite sample upper bound for the bias, providing a clear advantage over traditional methods, especially when the shift is large. Finally, we designed an extensive range of policy evaluation scenarios, covering diverse magnitudes of shifts and a spectrum of logging and target policies. Our empirical results indicate that our approach significantly outperforms baseline methods, most notably in 90% of the cases under the policy shift-only settings and 72% of the scenarios under the general covariate shift settings.
- Susan Athey. Machine learning and causal inference for policy evaluation. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 5–6. ACM, 2015.
- Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
- Robust optimization, volume 28. Princeton university press, 2009.
- Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pp. 81–88, 2007.
- Statistical analysis of wasserstein distributionally robust estimators. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pp. 227–254. INFORMS, 2021.
- Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013.
- Robust covariate shift regression. In Artificial Intelligence and Statistics, 2016.
- Doubly robust policy evaluation and learning. In International Conference on Machine Learning (ICML), 2011.
- Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
- More robust doubly robust off-policy evaluation. In International Conference on Machine Learning (ICML), 2018.
- Adversarial multiclass classification: A risk minimization perspective. In Advances in Neural Information Processing Systems, pp. 559–567, 2016.
- Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
- A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
- Recommendations as treatments. AI Magazine, 42(3):19–30, 2021.
- Learning representations for counterfactual inference. In International conference on machine learning (ICML), 2016.
- Doubly robust distributionally robust off-policy evaluation and learning. In International Conference on Machine Learning, pp. 10598–10632. PMLR, 2022.
- Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539, 2007.
- Contextual gaussian process bandit optimization. In Neural Information Processing Systems (NeurIPS), 2011.
- The epoch-greedy algorithm for contextual multi-armed bandits. In Neural Information Processing Systems (NeurIPS), 2007.
- An actor-critic contextual bandit algorithm for personalized interventions using mobile devices. Neural Information Processing Systems (NeurIPS), 2014.
- A contextual-bandit approach to personalized news article recommendation. In The Web Conference (WWW), 2010.
- Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pp. 3122–3130. PMLR, 2018.
- Robust covariate shift prediction with general losses and feature views. arXiv preprint arXiv:1712.10043, 2017.
- Robust regression for safe exploration in control. arXiv preprint arXiv:1906.05819, 2019.
- Off-policy policy evaluation for sequential decisions under unobserved confounding. arXiv preprint arXiv:2003.05623, 2020.
- Counterfactual off-policy evaluation with gumbel-max structural causal models. arXiv preprint arXiv:1905.05824, 2019.
- External validity: From do-calculus to transportability across populations. In Probabilistic and causal inference: The works of Judea Pearl, pp. 451–482. ACM, 2022.
- Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
- Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
- On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
- Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. JMLR. org, 2017.
- Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pp. 8884–8894. PMLR, 2020.
- On distributionally robust optimization and data rebalancing. In International Conference on Artificial Intelligence and Statistics, pp. 1283–1297. PMLR, 2022.
- Doubly robust off-policy evaluation with shrinkage. arXiv preprint arXiv:1907.09623, 2019a.
- Cab: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning, pp. 6005–6014. PMLR, 2019b.
- Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
- Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731–1755, 2015a.
- The self-normalized estimator for counterfactual learning. In Neural Information Processing Systems (NeurIPS), 2015b.
- Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006.
- Automatic ad format selection via contextual bandits. In ACM International Conference on Information and Knowledge Management (CIKM), 2013.
- From ads to interventions: Contextual bandits in mobile health. Mobile health: sensors, analytic methods, and applications, pp. 495–517, 2017.
- Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning (ICML), 2016.
- Off-policy evaluation and learning for external validity under a covariate shift. Advances in Neural Information Processing Systems, 33:49–61, 2020.
- Variance reduction in monte-carlo tree search. In Advances in Neural Information Processing Systems, pp. 1836–1844, 2011.
- Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning (ICML), 2017.
- Distributionally robust policy gradient for offline contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp. 6443–6462. PMLR, 2023.
- Hierarchical exploration for accelerating contextual bandits. In International Conference on Machine Learning (ICML), 2012.