Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Analysis of Off-Policy Multi-Step TD-Learning with Linear Function Approximation (2402.15781v2)

Published 24 Feb 2024 in eess.SY, cs.LG, and cs.SY

Abstract: This paper analyzes multi-step TD-learning algorithms within the `deadly triad' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n-step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, and the control theoretic approach, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, two n-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the gradient and control theoretic algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  2. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning.” in International Conference on learning representations, 2016.
  3. N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
  4. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  5. M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in International Conference on Machine Learning, 2017, pp. 449–458.
  6. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
  7. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  8. R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
  9. J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674–690, 1997.
  10. H. Van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil, “Deep reinforcement learning and the deadly triad,” arXiv preprint arXiv:1812.02648, 2018.
  11. Z. Chen, J.-P. Clarke, and S. T. Maguluri, “Target network and truncation overcome the deadly triad in-learning,” SIAM Journal on Mathematics of Data Science, vol. 5, no. 4, pp. 1078–1101, 2023.
  12. R. S. Sutton, H. R. Maei, and C. Szepesvári, “A convergent O⁢(n)𝑂𝑛O(n)italic_O ( italic_n ) temporal-difference algorithm for off-policy learning with linear function approximation,” in Advances in neural information processing systems, 2009, pp. 1609–1616.
  13. R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 993–1000.
  14. S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M. White, “Gradient temporal-difference learning with regularized corrections,” in International Conference on Machine Learning, 2020, pp. 3524–3534.
  15. D. Lee, H.-D. Lim, J. Park, and O. Choi, “New versions of gradient temporal difference learning,” IEEE Transactions on Automatic Control, vol. 68, no. 8, pp. 5006–5013, 2023.
  16. H.-D. Lim and D. Lee, “Backstepping temporal difference learning,” in The 11th International Conference on Learning Representations, 2022.
  17. Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “Finite-sample analysis of off-policy td-learning via generalized bellman operators,” Advances in Neural Information Processing Systems, vol. 34, pp. 21 440–21 452, 2021.
  18. A. R. Mahmood, H. Yu, and R. S. Sutton, “Multi-step off-policy learning without importance sampling ratios,” arXiv preprint arXiv:1702.03006, 2017.
  19. K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, “Multi-step reinforcement learning: A unifying algorithm,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  20. D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function approximation,” in ICML, 2001, pp. 417–424.
  21. H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation.” in ICML, vol. 10, 2010, pp. 719–726.
  22. H. van Seijen, “Effective multi-step temporal-difference learning for non-linear function approximation,” arXiv preprint arXiv:1608.05151, 2016.
  23. L. Mandal and S. Bhatnagar, “n-step temporal difference learning with optimal n,” arXiv preprint arXiv:2303.07068, 2023.
  24. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  25. Y. Yuan, Z. L. Yu, Z. Gu, Y. Yeboah, W. Wei, X. Deng, J. Li, and Y. Li, “A novel multi-step q-learning method to improve data efficiency for deep reinforcement learning,” Knowledge-Based Systems, vol. 175, pp. 107–117, 2019.
  26. M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  27. J. F. Hernandez-Garcia and R. S. Sutton, “Understanding multi-step deep reinforcement learning: A systematic study of the dqn target,” arXiv preprint arXiv:1901.07510, 2019.
  28. L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995, 1995, pp. 30–37.
  29. V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
  30. G. Qu and N. Li, “On the exponential stability of primal-dual gradient dynamics,” IEEE Control Systems Letters, vol. 3, no. 1, pp. 43–48, 2018.

Summary

We haven't generated a summary for this paper yet.