Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation (2402.15781v2)
Abstract: This paper analyzes multi-step TD-learning algorithms within the `deadly triad' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n-step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, and the control theoretic approach, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, two n-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the gradient and control theoretic algorithms.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning.” in International Conference on learning representations, 2016.
- N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
- H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
- M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in International Conference on Machine Learning, 2017, pp. 449–458.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
- J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
- H. Van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil, “Deep reinforcement learning and the deadly triad,” arXiv preprint arXiv:1812.02648, 2018.
- Z. Chen, J.-P. Clarke, and S. T. Maguluri, “Target network and truncation overcome the deadly triad in-learning,” SIAM Journal on Mathematics of Data Science, vol. 5, no. 4, pp. 1078–1101, 2023.
- R. S. Sutton, H. R. Maei, and C. Szepesvári, “A convergent O(n)𝑂𝑛O(n)italic_O ( italic_n ) temporal-difference algorithm for off-policy learning with linear function approximation,” in Advances in neural information processing systems, 2009, pp. 1609–1616.
- R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 993–1000.
- S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M. White, “Gradient temporal-difference learning with regularized corrections,” in International Conference on Machine Learning, 2020, pp. 3524–3534.
- D. Lee, H.-D. Lim, J. Park, and O. Choi, “New versions of gradient temporal difference learning,” IEEE Transactions on Automatic Control, vol. 68, no. 8, pp. 5006–5013, 2023.
- H.-D. Lim and D. Lee, “Backstepping temporal difference learning,” in The 11th International Conference on Learning Representations, 2022.
- Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “Finite-sample analysis of off-policy td-learning via generalized bellman operators,” Advances in Neural Information Processing Systems, vol. 34, pp. 21 440–21 452, 2021.
- A. R. Mahmood, H. Yu, and R. S. Sutton, “Multi-step off-policy learning without importance sampling ratios,” arXiv preprint arXiv:1702.03006, 2017.
- K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, “Multi-step reinforcement learning: A unifying algorithm,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” in ICML, 2001, pp. 417–424.
- H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation.” in ICML, vol. 10, 2010, pp. 719–726.
- H. van Seijen, “Effective multi-step temporal-difference learning for non-linear function approximation,” arXiv preprint arXiv:1608.05151, 2016.
- L. Mandal and S. Bhatnagar, “n-step temporal difference learning with optimal n,” arXiv preprint arXiv:2303.07068, 2023.
- J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- Y. Yuan, Z. L. Yu, Z. Gu, Y. Yeboah, W. Wei, X. Deng, J. Li, and Y. Li, “A novel multi-step q-learning method to improve data efficiency for deep reinforcement learning,” Knowledge-Based Systems, vol. 175, pp. 107–117, 2019.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- J. F. Hernandez-Garcia and R. S. Sutton, “Understanding multi-step deep reinforcement learning: A systematic study of the dqn target,” arXiv preprint arXiv:1901.07510, 2019.
- L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995, 1995, pp. 30–37.
- V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
- G. Qu and N. Li, “On the exponential stability of primal-dual gradient dynamics,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 43–48, 2018.