Papers
Topics
Authors
Recent
2000 character limit reached

Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

Published 11 Mar 2024 in cs.LG | (2403.06366v3)

Abstract: Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems where an agent aims to maximize the entropy regularized value function. Despite its empirical success, there have been limited theoretical studies of soft Q-learning to date. This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms. We focus on two types of soft Q-learning algorithms: one utilizing the log-sum-exp operator and the other employing the Boltzmann operator. By using dynamical switching system models, we derive novel finite-time error bounds for both soft Q-learning algorithms. We hope that our analysis will deepen the current understanding of soft Q-learning by establishing connections with switching system models and may even pave the way for new frameworks in the finite-time analysis of other reinforcement learning algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  2. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  3. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  4. C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  5. T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
  6. V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
  7. E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of machine learning Research, vol. 5, no. Dec, pp. 1–25, 2003.
  8. C. L. Beck and R. Srikant, “Error bounds for constant step-size Q-learning,” Systems & Control letters, vol. 61, no. 12, pp. 1203–1208, 2012.
  9. G. Qu and A. Wierman, “Finite-time analysis of asynchronous stochastic approximation and Q-learning,” arXiv preprint arXiv:2002.00260, 2020.
  10. M. J. Wainwright, “Stochastic approximation with cone-contractive operators: Sharp ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-bounds for Q-learning,” arXiv preprint arXiv:1905.06265, 2019.
  11. D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,” SIAM Journal on Control and Optimization, vol. 61, no. 3, pp. 1861–1880, 2023.
  12. T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in International conference on machine learning.   PMLR, 2017, pp. 1352–1361.
  13. G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized Markov decision processes,” arXiv preprint arXiv:1705.07798, 2017.
  14. O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” Advances in neural information processing systems, 2017.
  15. J. Schulman, X. Chen, and P. Abbeel, “Equivalence between policy gradients and soft q-learning,” arXiv preprint arXiv:1704.06440, 2017.
  16. K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2017, pp. 243–252.
  17. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  18. L. Pan, Q. Cai, Q. Meng, W. Chen, L. Huang, and T.-Y. Liu, “Reinforcement learning with dynamic boltzmann softmax updates,” arXiv preprint arXiv:1903.05926, 2019.
  19. M. Geist, B. Scherrer, and O. Pietquin, “A theory of regularized markov decision processes,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2160–2169.
  20. L. Liang, Y. Xu, S. McAleer, D. Hu, A. Ihler, P. Abbeel, and R. Fox, “Temporal-difference value estimation via uncertainty-guided soft updates,” arXiv preprint arXiv:2110.14818, 2021.
  21. Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal difference and Q learning provably converge to global optima,” Mathematics of Operations Research, 2023.
  22. E. Smirnova and E. Dohmatob, “On the convergence of approximate and regularized policy iteration schemes,” arXiv preprint arXiv:1909.09621, 2019.
  23. D. Ying, Y. Ding, and J. Lavaei, “A dual approach to constrained markov decision processes with entropy regularization,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 1887–1909.
  24. E. Wei, D. Wicke, D. Freelan, and S. Luke, “Multiagent soft q-learning,” arXiv preprint arXiv:1804.09817, 2018.
  25. H. K. Khalil, “Nonlinear systems third edition (2002),” 2002.
  26. H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009.
  27. D. Lee, “Final iteration convergence bound of q-learning: Switching system approach,” IEEE Transactions on Automatic Control, 2024.
  28. D. Lee and D. W. Kim, “Analysis of temporal difference learning: Linear system approach,” arXiv preprint arXiv:2204.10479, 2022.
  29. A. Gosavi, “Boundedness of iterates in q-learning,” Systems & control letters, vol. 55, no. 4, pp. 347–349, 2006.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.