Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Regularized Q-Learning with Linear Function Approximation (2401.15196v3)

Published 26 Jan 2024 in cs.AI

Abstract: Regularized Markov Decision Processes serve as models of sequential decision making under uncertainty wherein the decision maker has limited information processing capacity and/or aversion to model ambiguity. With functional approximation, the convergence properties of learning algorithms for regularized MDPs (e.g. soft Q-learning) are not well understood because the composition of the regularized BeLLMan operator and a projection onto the span of basis vectors is not a contraction with respect to any norm. In this paper, we consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation. The {\em lower} level optimization problem aims to identify a value function approximation that satisfies BeLLMan's recursive optimality condition and the {\em upper} level aims to find the projection onto the span of basis vectors. This formulation motivates a single-loop algorithm with finite time convergence guarantees. The algorithm operates on two time-scales: updates to the projection of state-action values are slow' in that they are implemented with a step size that is smaller than the one used forfaster' updates of approximate solutions to BeLLMan's recursive optimality equation. We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise. In addition, we provide a performance guarantee for the policies derived from the proposed algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  2. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  3. B. Eysenbach and S. Levine, “Maximum entropy RL (provably) solves some robust RL problems,” arXiv preprint arXiv:2103.06257, 2021.
  4. K. Lee, S. Choi, and S. Oh, “Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1466–1473, 2018.
  5. A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning.   PMLR, 2016, pp. 1614–1623.
  6. W. Yang, X. Li, and Z. Zhang, “A regularized approach to sparse optimal policy in reinforcement learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  7. M. Geist, B. Scherrer, and O. Pietquin, “A theory of regularized markov decision processes,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2160–2169.
  8. E. Derman, M. Geist, and S. Mannor, “Twice regularized MDPs and the equivalence between robustness and regularization,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 274–22 287, 2021.
  9. Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International conference on machine learning.   PMLR, 2019, pp. 151–160.
  10. S. Cayci, N. He, and R. Srikant, “Linear convergence of entropy-regularized natural policy gradient with linear function approximation,” arXiv preprint arXiv:2106.04096, 2021.
  11. J. Mei, C. Xiao, C. Szepesvari, and D. Schuurmans, “On the global convergence rates of softmax policy gradient methods,” in International Conference on Machine Learning.   PMLR, 2020, pp. 6820–6829.
  12. L. Shani, Y. Efroni, and S. Mannor, “Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5668–5675.
  13. C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992.
  14. J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine learning, vol. 16, pp. 185–202, 1994.
  15. A. M. Devraj and S. P. Meyn, “Q-learning with uniformly bounded variance,” IEEE Transactions on Automatic Control, vol. 67, no. 11, pp. 5948–5963, 2021.
  16. Z. Chen, S. Zhang, T. T. Doan, J.-P. Clarke, and S. T. Maguluri, “Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning,” Automatica, vol. 146, p. 110623, 2022.
  17. F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 664–671.
  18. R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 993–1000.
  19. L. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in Machine Learning Proceedings 1995.   Elsevier, 1995, pp. 30–37.
  20. H.-D. Lim, D. Lee et al., “RegQ: Convergent q-learning with linear function approximation using regularization,” 2023.
  21. S. Zhang, H. Yao, and S. Whiteson, “Breaking the deadly triad with a target network,” in International Conference on Machine Learning.   PMLR, 2021, pp. 12 621–12 631.
  22. D. Lee and N. He, “A unified switching system perspective and ode analysis of q-learning algorithms,” arXiv preprint arXiv:1912.02270, 2019.
  23. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  24. D. Carvalho, F. S. Melo, and P. Santos, “A new convergent variant of q-learning with linear function approximation,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 412–19 421, 2020.
  25. H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation.” in ICML, vol. 10, 2010, pp. 719–726.
  26. S. Ma, Z. Chen, Y. Zhou, and S. Zou, “Greedy-GQ with variance reduction: Finite-time analysis and improved complexity,” arXiv preprint arXiv:2103.16377, 2021.
  27. Y. Wang, Y. Zhou, and S. Zou, “Finite-time error bounds for Greedy-GQ,” arXiv preprint arXiv:2209.02555, 2022.
  28. Y. Wang and S. Zou, “Finite-sample analysis of Greedy-GQ with linear function approximation under Markovian noise,” in Conference on Uncertainty in Artificial Intelligence.   PMLR, 2020, pp. 11–20.
  29. T. Xu and Y. Liang, “Sample complexity bounds for two timescale value-based reinforcement learning algorithms,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 811–819.
  30. Z. Chen, J. P. Clarke, and S. T. Maguluri, “Target network and truncation overcome the deadly triad in q𝑞qitalic_q-learning,” arXiv preprint arXiv:2203.02628, 2022.
  31. S. Meyn, “Stability of q-learning through design and optimism,” arXiv preprint arXiv:2307.02632, 2023.
  32. H. Hasselt, “Double q-learning,” Advances in neural information processing systems, vol. 23, 2010.
  33. D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode reinforcement learning,” Journal of Machine Learning Research, vol. 6, 2005.
  34. A. M. Devraj and S. Meyn, “Zap q-learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  35. S. Chen, A. M. Devraj, F. Lu, A. Busic, and S. Meyn, “Zap q-learning with nonlinear function approximation,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 879–16 890, 2020.
  36. Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal-difference learning converges to global optima,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  37. L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient methods: Global optimality and rates of convergence,” arXiv preprint arXiv:1909.01150, 2019.
  38. J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep q-learning,” in Learning for dynamics and control.   PMLR, 2020, pp. 486–489.
  39. J. Sirignano and K. Spiliopoulos, “Asymptotics of reinforcement learning with neural networks,” Stochastic Systems, vol. 12, no. 1, pp. 2–29, 2022.
  40. S. Cayci, S. Satpathi, N. He, and R. Srikant, “Sample complexity and overparameterization bounds for temporal difference learning with neural network approximation,” IEEE Transactions on Automatic Control, 2023.
  41. S. Cayci, N. He, and R. Srikant, “Finite-time analysis of entropy-regularized neural natural actor-critic algorithm,” arXiv preprint arXiv:2206.00833, 2022.
  42. P. Xu and Q. Gu, “A finite-time analysis of q-learning with neural network function approximation,” in International Conference on Machine Learning.   PMLR, 2020, pp. 10 555–10 565.
  43. Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  44. J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conference on learning theory.   PMLR, 2018, pp. 1691–1692.
  45. H. Shen, K. Zhang, M. Hong, and T. Chen, “Asynchronous advantage actor critic: Non-asymptotic analysis and linear speedup,” 2020.
  46. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  47. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  48. Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke, “Performance of q-learning with linear function approximation: Stability and finite-time analysis,” arXiv preprint arXiv:1905.11425, p. 4, 2019.
  49. T. Xu, S. Zou, and Y. Liang, “Two time-scale off-policy TD learning: Non-asymptotic analysis over markovian samples,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  50. V. S. Borkar, “Stochastic approximation with two time scales,” Systems & Control Letters, vol. 29, no. 5, pp. 291–294, 1997.
  51. V. S. Borkar and S. Pattathil, “Concentration bounds for two time scale stochastic approximation,” in 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2018, pp. 504–511.
  52. M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic,” SIAM Journal on Optimization, vol. 33, no. 1, pp. 147–180, 2023.
  53. T. Xu, Z. Wang, and Y. Liang, “Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms,” arXiv preprint arXiv:2005.03557, 2020.
  54. J. Zhang, T. He, S. Sra, and A. Jadbabaie, “Why gradient clipping accelerates training: A theoretical justification for adaptivity,” arXiv preprint arXiv:1905.11881, 2019.
  55. B. Zhang, J. Jin, C. Fang, and L. Wang, “Improved analysis of clipping algorithms for non-convex optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 511–15 521, 2020.
  56. S. Zeng, C. Li, A. Garcia, and M. Hong, “Maximum-likelihood inverse reinforcement learning with finite-time guarantees,” arXiv preprint arXiv:2210.01808, 2022.
  57. J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE TRANSACTIONS ON AUTOMATIC CONTROL, vol. 42, no. 5, 1997.
  58. A. Zanette, “When is realizability sufficient for off-policy reinforcement learning?” arXiv preprint arXiv:2211.05311, 2022.
  59. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI gym,” arXiv preprint arXiv:1606.01540, 2016.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets