Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Regularized Q-learning (2202.05404v7)

Published 11 Feb 2022 in cs.LG

Abstract: Q-learning is widely used algorithm in reinforcement learning community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. We prove its stability using a recent analysis tool based on switching system models. Moreover, we experimentally show that it converges in environments where Q-learning with linear function approximation has known to diverge. We also provide an error bound on the solution where the algorithm converges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Online target q-learning with reverse experience replay: Efficiently finding the optimal policy for linear mdps. arXiv preprint arXiv:2110.08440, 2021.
  2. Fixed point theory in metric spaces. Recent Advances and Applications, 2018.
  3. Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
  4. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  5. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  6. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33–57, 1996.
  7. A new convergent variant of q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33:19412–19421, 2020.
  8. Target network and truncation overcome the deadly triad in q𝑞qitalic_q-learning. arXiv preprint arXiv:2203.02628, 2022.
  9. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105(3):589–608, 2000.
  10. Zap q-learning. Advances in Neural Information Processing Systems, 30, 2017.
  11. A convergent off-policy temporal difference algorithm. arXiv preprint arXiv:1911.05697, 2019.
  12. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524–3534. PMLR, 2020.
  13. Abhijit Gosavi. Boundedness of iterates in q-learning. Systems & control letters, 55(4):347–349, 2006.
  14. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018.
  15. Monotone dynamical systems. In Handbook of differential equations: ordinary differential equations, volume 2, pages 239–357. Elsevier, 2006.
  16. RA Horn and CR Johnson. Matrix analysis second edition, 2013.
  17. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
  18. Hassan K Khalil. Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ, 2002. URL https://cds.cern.ch/record/1173048. The book can be consulted by contacting: PH-AID: Wallet, Lionel.
  19. Deepmellow: removing the need for a target network in deep q-learning. In Proceedings of the twenty eighth international joint conference on artificial intelligence, 2019.
  20. Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487, 2020.
  21. A unified switching system perspective and ode analysis of q-learning algorithms. arXiv preprint arXiv:1912.02270, 2019.
  22. Versions of gradient temporal difference learning. arXiv preprint arXiv:2109.04033, 2021.
  23. Daniel Liberzon. Switching in systems and control. Springer Science & Business Media, 2003.
  24. Convex q-learning. In 2021 American Control Conference (ACC), pages 4749–4756. IEEE, 2021.
  25. Hamid Reza Maei. Gradient temporal-difference learning algorithms. 2011.
  26. Toward off-policy learning control with function approximation. In ICML, 2010.
  27. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv preprint arXiv:1405.6757, 2014.
  28. Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
  29. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664–671, 2008.
  30. Sean Meyn. Stability of q-learning through design and optimism. arXiv preprint arXiv:2307.02632, 2023.
  31. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  32. Criteria of asymptotic stability of differential and difference inclusions encountered in control theory. Systems & Control Letters, 13(1):59–64, 1989.
  33. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  34. Reinforcement learning: An introduction. MIT press, 2018.
  35. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000, 2009.
  36. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
  37. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59–94, 1996.
  38. An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control, 42(5):674–690, 1997.
  39. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  40. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
  41. Breaking the deadly triad with a target network. arXiv preprint arXiv:2101.08862, 2021.
Citations (9)

Summary

We haven't generated a summary for this paper yet.