Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces (2403.05811v4)
Abstract: Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution $\eta\pi$ for a given policy $\pi$. Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a $\gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need $\tilde{O}(\varepsilon{-2}\mu_{\min}{-1}(1-\gamma){-3})$ interactions with the environment to achieve an $\varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $1$-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $1$-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a $\tilde{O}(\varepsilon{-2} \mu_{\pi,\min}{-1}(1-\gamma){-3}+t_{mix}\mu_{\pi,\min}{-1}(1-\gamma){-1})$ sample complexity bounds in the case of the $1$-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.
- A distributional perspective on reinforcement learning. In International conference on machine learning, pages 449–458. PMLR, 2017.
- Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
- M. Böck and C. Heitzinger. Speedy categorical distributional reinforcement learning and complexity analysis. SIAM Journal on Mathematics of Data Science, 4(2):675–693, 2022. doi: 10.1137/20M1364436. URL https://doi.org/10.1137/20M1364436.
- Superhuman performance on sepsis mimic-iii data by distributional reinforcement learning. PLoS One, 17(11):e0275358, 2022.
- V. I. Bogachev. Measure theory, volume 1. Springer, 2007.
- Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Speedy q-learning. Advances in neural information processing systems, 24, 2011.
- Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349, 2013.
- There is a risk-return trade-off after all. Journal of financial economics, 76(3):509–548, 2005.
- S. M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.
- A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning, 49:193–208, 2002.
- P. W. Lavori and R. Dawson. Dynamic treatment regimes: practical design considerations. Clinical trials, 1(1):9–20, 2004.
- Is q-learning minimax optimal? a tight sample complexity analysis. Operations Research, 72(1):222–236, 2024.
- S. Luo. On azuma-type inequalities for banach space-valued martingales. Journal of Theoretical Probability, 35(2):772–800, Jun 2022. ISSN 1572-9230. doi: 10.1007/s10959-021-01086-5. URL https://doi.org/10.1007/s10959-021-01086-5.
- Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799–806, 2010.
- G. Pisier. Martingales in Banach Spaces. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2016. doi: 10.1017/CBO9781316480588.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- N. Ross. Fundamentals of stein’s method. Probability Surveys, 8:210–293, 2011.
- An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018.
- An analysis of quantile temporal-difference learning. arXiv preprint arXiv:2301.04462, 2023.
- Near-minimax-optimal distributional reinforcement learning with a generative model. arXiv preprint arXiv:2402.07598, 2024.
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
- R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596.
- Distributional offline policy evaluation with predictive error guarantees. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 37685–37712. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/wu23s.html.
- Estimation and inference in distributional reinforcement learning. arXiv preprint arXiv:2309.17262, 2023.