Low Variance Off-policy Evaluation with State-based Importance Sampling (2212.03932v5)
Abstract: In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
- MIT Press, 1998.
- T. Udagawa, H. Kiyohara, Y. Narita, Y. Saito, and K. Tateno, “Policy-Adaptive Estimator Selection for Off-Policy Evaluation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2023), vol. 37, pp. 10025–10033, nov 2023.
- S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” arXiv preprint, pp. 10–16, may 2020.
- D. Precup, R. S. Sutton, and S. P. Singh, “Eligibility Traces for Off-Policy Policy Evaluation,” in Proceedings of the International Conference on Machine Learning (ICML 2000), pp. 759–766, 2000.
- P. S. Thomas, G. Theocharous, and M. Ghavamzadeh, “High confidence off-policy evaluation,” in Proceedings of the National Conference on Artificial Intelligence (AAAI 2015), pp. 3000–3006, 2015.
- Y. Chandak, S. M. Jordan, G. Theocharous, M. White, and P. S. Thomas, “Towards safe policy improvement for non-stationary MDPs,” in Advances in Neural Information Processing Systems (NeurIPS 2020), pp. 1–13, 2020.
- Q. Liu, Z. Tang, L. Li, and D. Zhou, “Breaking the curse of horizon: Infinite-horizon off-policy estimation,” in Advances in Neural Information Processing Systems (NeurIPS 2018), pp. 5356–5366, 2018.
- Z. D. Guo, P. S. Thomas, and E. Brunskill, “Using options and covariance testing for long horizon off-policy policy evaluation,” in Advances in Neural Information Processing Systems (NeurIPS 2017), pp. 2493–2502, 2017.
- D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” in Proceedings of the International Conference on Machine Learning (ICML 2001), pp. 417–424, 2001.
- P. S. Thomas and E. Brunskill, “Data-efficient off-policy policy evaluation for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML 2016), pp. 3158–3167, 2016.
- T. Popoviciu, “Sur les équations algébriques ayant toutes leurs racines rélles,” Mathematica, pp. 129–145, 1935.
- Y. Liu, P. L. Bacon, and E. Brunskill, “Understanding the curse of horizon in off-policy evaluation via conditional importance sampling,” in Proceedings of the International Conference on Machine Learning (ICML 2020), pp. 6140–6149, 2020.
- T. Xie, Y. Ma, and Y. X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” in Advances in Neural Information Processing Systems (NeurIPS 2019), pp. 1–11, 2019.
- M. Yin and Y.-X. Wang, “Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2020), pp. 1–10, 2020.
- M. Rowland, A. Harutyunyan, H. van Hasselt, D. Borsa, T. Schaul, R. Munos, and W. Dabney, “Conditional Importance Sampling for Off-Policy Learning,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), pp. 1–10, 2020.
- M. Dudik, J. Langford, and L. Li, “Doubly Robust Policy Evaluation and Learning,” in Proceedings of the International Conference on Machine Learning (ICML 2011), pp. 1–8, 2011.
- N. Jiang and L. Li, “Doubly robust off-policy value evaluation for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML 2016), pp. 1022–1035, 2016.
- M. Farajtabar, Y. Chow, and M. Ghavamzadeh, “More Robust Doubly Robust Off-policy Evaluation,” in Proceedings of the International Conference on Machine Learning (ICML 2018), pp. 2342–2371, 2018.
- H. Bang and J. M. Robins, “Doubly robust estimation in missing data and causal inference models,” Biometrics, vol. 61, no. 4, pp. 962–973, 2005.
- R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, vol. 112, no. 1, pp. 181–211, 1999.
- 2013.
- A. M. Metelli, M. Papini, N. Montali, and M. Restelli, “Importance sampling techniques for policy optimization,” Journal of Machine Learning Research, vol. 21, pp. 1–75, 2020.
- G. Theocharous, P. S. Thomas, and M. Ghavamzadeh, “Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees,” in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 1806–1812, 2015.
- P. S. Thomas, G. Theocharous, and M. Ghavamzadeh, “High confidence policy improvement,” in Proceedings of the International Conference on Machine Learning (ICML 2015), pp. 2370–2378, 2015.
- A. M. Metelli, M. Papini, F. Francesco, and M. Restelli, “Policy Optimisation via Importance Sampling,” in Proceedings of the International Conference on Machine Learning (ICML 2019), pp. 1–13, 2019.
- M. Yin, Y. Bai, and Y. X. Wang, “Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2021), vol. 130, pp. 1567–1575, 2021.
- M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. 2005.