Conformal Off-Policy Evaluation in Markov Decision Processes
Abstract: Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
- Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.
- Conformal prediction intervals for markov decision process trajectories. arXiv preprint arXiv:2206.04860, 2022.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- Peter W Glynn et al. Importance sampling for monte carlo estimation of quantiles. In Mathematical Methods in Stochastic Simulation and Experimental Design: Proceedings of the 2nd St. Petersburg Workshop on Simulation, pages 180–185. Citeseer, 1996.
- Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
- Double reinforcement learning for efficient off-policy evaluation in markov decision processes. The Journal of Machine Learning Research, 21(1):6742–6804, 2020.
- Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 2022.
- Confident off-policy evaluation and selection through self-normalized importance weighting. In International Conference on Artificial Intelligence and Statistics, pages 640–648. PMLR, 2021.
- Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
- Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96, 2014.
- Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021.
- Safe planning in dynamic environments using conformal prediction. arXiv preprint arXiv:2210.10254, 2022.
- Predicting the rate of skin penetration using an aggregated conformal prediction framework. Molecular Pharmaceutics, 14(5):1571–1576, 2017.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018.
- Three applications of conformal prediction for rating breast density in mammography. arXiv preprint arXiv:2206.12008, 2022.
- Valery Manokhin. Awesome conformal prediction, April 2022. "If you use Awesome Conformal Prediction, please cite it as below.".
- Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345–356. Springer, 2002.
- Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.
- Deeply-debiased off-policy interval estimation. In International Conference on Machine Learning, pages 9580–9591. PMLR, 2021.
- Statistical inference of the value function for reinforcement learning in infinite-horizon settings. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3):765–793, 2022.
- Conformal off-policy prediction in contextual bandits. arXiv preprint arXiv:2206.04405, 2022.
- High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
- High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388. PMLR, 2015.
- An introduction to the bootstrap. Monographs on statistics and applied probability, 57(1), 1993.
- Conformal prediction under covariate shift. Advances in neural information processing systems, 32, 2019.
- A review of off-policy evaluation in reinforcement learning, 2022.
- Algorithmic learning in a random world. Springer Science & Business Media, 2005.
- Application of conformal prediction interval estimations to market makers’ net positions. In Conformal and Probabilistic Prediction and Applications, pages 285–301. PMLR, 2020.
- Conformal prediction for hypersonic flight vehicle classification. In Conformal and Probabilistic Prediction with Applications, pages 118–206. PMLR, 2022.
- An electronic nose-based assistive diagnostic prototype for lung cancer detection with conformal prediction. Measurement, 158:107588, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.