Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach (2403.06366v1)
Abstract: Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems where an agent aims to maximize the entropy regularized value function. Despite its empirical success, there have been limited theoretical studies of soft Q-learning to date. This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms. We focus on two types of soft Q-learning algorithms: one utilizing the log-sum-exp operator and the other employing the Boltzmann operator. By using dynamical switching system models, we derive novel finite-time error bounds for both soft Q-learning algorithms. We hope that our analysis will deepen the current understanding of soft Q-learning by establishing connections with switching system models and may even pave the way for new frameworks in the finite-time analysis of other reinforcement learning algorithms.
- H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
- T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
- V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
- E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of machine learning Research, vol. 5, no. Dec, pp. 1–25, 2003.
- C. L. Beck and R. Srikant, “Error bounds for constant step-size Q-learning,” Systems & Control letters, vol. 61, no. 12, pp. 1203–1208, 2012.
- G. Qu and A. Wierman, “Finite-time analysis of asynchronous stochastic approximation and Q-learning,” arXiv preprint arXiv:2002.00260, 2020.
- M. J. Wainwright, “Stochastic approximation with cone-contractive operators: Sharp ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-bounds for Q-learning,” arXiv preprint arXiv:1905.06265, 2019.
- D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,” SIAM Journal on Control and Optimization, vol. 61, no. 3, pp. 1861–1880, 2023.
- T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in International conference on machine learning. PMLR, 2017, pp. 1352–1361.
- G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized Markov decision processes,” arXiv preprint arXiv:1705.07798, 2017.
- O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” Advances in neural information processing systems, 2017.
- J. Schulman, X. Chen, and P. Abbeel, “Equivalence between policy gradients and soft q-learning,” arXiv preprint arXiv:1704.06440, 2017.
- K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learning,” in International Conference on Machine Learning. PMLR, 2017, pp. 243–252.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
- L. Pan, Q. Cai, Q. Meng, W. Chen, L. Huang, and T.-Y. Liu, “Reinforcement learning with dynamic boltzmann softmax updates,” arXiv preprint arXiv:1903.05926, 2019.
- M. Geist, B. Scherrer, and O. Pietquin, “A theory of regularized markov decision processes,” in International Conference on Machine Learning. PMLR, 2019, pp. 2160–2169.
- L. Liang, Y. Xu, S. McAleer, D. Hu, A. Ihler, P. Abbeel, and R. Fox, “Temporal-difference value estimation via uncertainty-guided soft updates,” arXiv preprint arXiv:2110.14818, 2021.
- Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal difference and Q learning provably converge to global optima,” Mathematics of Operations Research, 2023.
- E. Smirnova and E. Dohmatob, “On the convergence of approximate and regularized policy iteration schemes,” arXiv preprint arXiv:1909.09621, 2019.
- D. Ying, Y. Ding, and J. Lavaei, “A dual approach to constrained markov decision processes with entropy regularization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 1887–1909.
- E. Wei, D. Wicke, D. Freelan, and S. Luke, “Multiagent soft q-learning,” arXiv preprint arXiv:1804.09817, 2018.
- H. K. Khalil, “Nonlinear systems third edition (2002),” 2002.
- H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009.
- D. Lee, “Final iteration convergence bound of q-learning: Switching system approach,” IEEE Transactions on Automatic Control, 2024.
- D. Lee and D. W. Kim, “Analysis of temporal difference learning: Linear system approach,” arXiv preprint arXiv:2204.10479, 2022.
- A. Gosavi, “Boundedness of iterates in q-learning,” Systems & control letters, vol. 55, no. 4, pp. 347–349, 2006.
- Narim Jeong (1 paper)
- Donghwan Lee (60 papers)