OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators (2405.17708v2)
Abstract: Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a confident estimate of its performance can lead to costly, unsafe, or hazardous outcomes, especially in education and healthcare. Several OPE estimators have been proposed in the last decade, many of which have hyperparameters and require training. Unfortunately, choosing the best OPE algorithm for each task and domain is still unclear. In this paper, we propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure. We prove that our estimator is consistent and satisfies several desirable properties for policy evaluation. Additionally, we demonstrate that when compared to alternative approaches, our estimator can be used to select higher-performing policies in healthcare and robotics. Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
- Cao, R. (1993). Bootstrapping the mean integrated squared error. Journal of Multivariate Analysis, 45(1):137–160.
- Learning bellman complete representations for offline policy evaluation. In International Conference on Machine Learning, pages 2938–2971. PMLR.
- Chen, Y.-C. (2017a). Introduction to resampling methods. lecture 5: Bootstrap. https://faculty.washington.edu/yenchic/17Sp_403/Lec5-bootstrap.pdf.
- Chen, Y.-C. (2017b). Introduction to resampling methods. lecture 9: Introduction to the bootstrap theory. https://faculty.washington.edu/yenchic/17Sp_403/Lec9_theory.pdf.
- Double/debiased machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.
- Using a bootstrap method to choose the sample fraction in tail index estimation. Journal of Multivariate analysis, 76(2):226–248.
- Bootstrap bandwidth selection in kernel density estimation from a contaminated sample. Annals of the Institute of Statistical Mathematics, 56(1):19–47.
- Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913.
- Model-based offline reinforcement learning with local misspecification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7423–7431.
- Bootstrap for correcting the mean square error of prediction and smoothed estimates in structural models.
- Efron, B. (1990). More efficient bootstrap computations. Journal of the American Statistical Association, 85(409):79–89.
- Efron, B. (1992). Bootstrap methods: another look at the jackknife. Springer.
- An introduction to the bootstrap. CRC press.
- More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
- Benchmarks for deep off-policy evaluation. arXiv preprint arXiv:2103.16596.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR.
- Bootstrapping statistical functionals. Statistics & probability letters, 39(3):229–236.
- Variational latent branching model for off-policy evaluation. arXiv preprint arXiv:2301.12056.
- A note on bootstrapping the sample median. The Annals of Statistics, 12(3):1130–1135.
- Combining parametric and nonparametric models for off-policy evaluation. In International Conference on Machine Learning, pages 2366–2375. PMLR.
- Hall, P. (1990). Using the bootstrap to estimate mean squared error and select smoothing parameter in nonparametric problems. Journal of multivariate analysis, 32(2):177–203.
- Hong, H. (1999). Lecture 11: Bootstrap.
- Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR.
- Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Advances in neural information processing systems, 33:22293–22304.
- The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24(11):1716–1720.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
- Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR.
- Oracle inequalities for model selection in offline reinforcement learning. arXiv preprint arXiv:2211.02016.
- Bootstrap variance estimation of nonlinear functions of parameters: an application to long-run elasticities of energy demand. Review of Economics and Statistics, 81(4):728–733.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31.
- Representation balancing mdps for off-policy policy evaluation. Advances in neural information processing systems, 31.
- Offline policy evaluation across representations with applications to educational games. In AAMAS, volume 1077.
- Non-parametric bootstrap mean squared error estimation for m-quantile estimators of small area averages, quantiles and poverty indicators. Computational Statistics & Data Analysis, 56(10):2889–2902.
- Mikusheva, A. (2013). Time series analysis. lecture 9: Bootstrap. https://ocw.mit.edu/courses/14-384-time-series-analysis-fall-2013/resources/mit14_384f13_lec9/.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32.
- A spectral approach to off-policy evaluation for pomdps. arXiv preprint arXiv:2109.10502.
- Off-policy policy evaluation for sequential decisions under unobserved confounding. Advances in Neural Information Processing Systems, 33:18819–18831.
- Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319.
- Counterfactual off-policy evaluation with gumbel-max structural causal models. In International Conference on Machine Learning, pages 4881–4890. PMLR.
- Păduraru, C. (2007). Planning with approximate and learned models of markov decision processes.
- Paduraru, C. (2013). Off-policy evaluation in markov decision processes.
- Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80.
- Reinforcement learning tutor better supported lower performers in a math task. Machine Learning, pages 1–26.
- Shao, J. (1990). Bootstrap estimation of the asymptotic variances of statistical functionals. Annals of the Institute of Statistical Mathematics, 42:737–752.
- A minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes. In International Conference on Machine Learning, pages 20057–20094. PMLR.
- Shi, X. (2012). Econ 715. lecture 10: Bootstrap. https://www.ssc.wisc.edu/~xshi/econ715/Lecture_10_bootstrap.pdf.
- Adaptive estimator selection for off-policy evaluation. In International Conference on Machine Learning, pages 9196–9205. PMLR.
- Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10276–10283.
- Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR.
- High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388. PMLR.
- Improved estimator selection for off-policy evaluation. Workshop on Reinforcement Learning Theory at the 38th International Conference on Machine Learning.
- Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854.
- Empirical study of off-policy policy evaluation for reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning, pages 3589–3597. PMLR.
- Williams, C. J. (2010). The bootstrap method for estimating mse. https://www.webpages.uidaho.edu/~chrisw/stat514/bootstrap1.pdf.
- Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
- Batch value-function approximation with only realizability. In International Conference on Machine Learning, pages 11404–11413. PMLR.
- Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142.
- Sope: Spectrum of off-policy estimators. Advances in Neural Information Processing Systems, 34:18958–18969.
- Non-parametric methods for partial identification of causal effects. Columbia CausalAI Laboratory Technical Report.
- Towards hyperparameter-free policy selection for offline reinforcement learning. Advances in Neural Information Processing Systems, 34.