Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum Markov Games: Switching System Approach (2306.05700v2)

Published 9 Jun 2023 in eess.SY, cs.GT, cs.LG, and cs.SY

Abstract: The objective of this paper is to investigate the finite-time analysis of a Q-learning algorithm applied to two-player zero-sum Markov games. Specifically, we establish a finite-time analysis of both the minimax Q-learning algorithm and the corresponding value iteration method. To enhance the analysis of both value iteration and Q-learning, we employ the switching system model of minimax Q-learning and the associated value iteration. This approach provides further insights into minimax Q-learning and facilitates a more straightforward and insightful convergence analysis. We anticipate that the introduction of these additional insights has the potential to uncover novel connections and foster collaboration between concepts in the fields of control theory and reinforcement learning communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  2. Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1995–2003.
  3. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning.” in International Conference on learning representations, 2016.
  4. N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
  5. H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  6. M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in International Conference on Machine Learning, 2017, pp. 449–458.
  7. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
  8. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  9. C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  10. J. N. Tsitsiklis, “Asynchronous stochastic approximation and Q-learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994.
  11. T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
  12. V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
  13. H. V. Hasselt, “Double Q-learning,” in Advances in Neural Information Processing Systems, 2010, pp. 2613–2621.
  14. F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 664–671.
  15. D. Lee and N. He, “A unified switching system perspective and convergence analysis of Q-learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020.
  16. A. M. Devraj and S. P. Meyn, “Zap Q-learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 2232–2241.
  17. C. Szepesvári, “The asymptotic convergence-rate of Q-learning,” in Advances in Neural Information Processing Systems, 1998, pp. 1064–1070.
  18. M. J. Kearns and S. P. Singh, “Finite-sample convergence rates for Q-learning and indirect algorithms,” in Advances in neural information processing systems, 1999, pp. 996–1002.
  19. E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal of machine learning Research, vol. 5, no. Dec, pp. 1–25, 2003.
  20. M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen, “Speedy Q-learning,” in Proceedings of the 24th International Conference on Neural Information Processing Systems, 2011, pp. 2411–2419.
  21. C. L. Beck and R. Srikant, “Error bounds for constant step-size Q-learning,” Systems & Control letters, vol. 61, no. 12, pp. 1203–1208, 2012.
  22. M. J. Wainwright, “Stochastic approximation with cone-contractive operators: sharp ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-bounds for Q-learning,” arXiv preprint arXiv:1905.06265, 2019.
  23. G. Qu and A. Wierman, “Finite-time analysis of asynchronous stochastic approximation and Q-learning,” arXiv preprint arXiv:2002.00260, 2020.
  24. G. Li, Y. Wei, Y. Chi, Y. Gu, and Y. Chen, “Sample complexity of asynchronous Q-learning: sharper analysis and variance reduction,” arXiv preprint arXiv:2006.03041, 2020.
  25. Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, “A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants,” arXiv preprint arXiv:2102.01567, 2021.
  26. D. Lee and N. He, “Periodic Q-learning,” in Learning for dynamics and control, 2020, pp. 582–598.
  27. D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,” SIAM Journal on Control and Optimization (accepted), 2022.
  28. H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009.
  29. L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  30. M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994, 1994, pp. 157–163.
  31. ——, “Value-function reinforcement learning in Markov games,” Cognitive systems research, vol. 2, no. 1, pp. 55–66, 2001.
  32. J. Fan, Z. Wang, Y. Xie, and Z. Yang, “A theoretical analysis of deep Q-learning,” arXiv preprint arXiv:1901.00137, 2019.
  33. Y. Zhu and D. Zhao, “Online minimax Q network learning for two-player zero-sum markov games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 3, pp. 1228–1241, 2020.
  34. R. B. Diddigi, C. Kamanchi, and S. Bhatnagar, “A generalized minimax Q-learning algorithm for two-player zero-sum stochastic games,” IEEE Transactions on Automatic Control, vol. 67, no. 9, pp. 4816–4823, 2022.
  35. G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor, “Finite sample analyses for TD(0) with function approximation,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  36. R. Srikant and L. Ying, “Finite-time error bounds for linear stochastic approximation andtd learning,” in Conference on Learning Theory, 2019, pp. 2803–2830.
  37. J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conference on learning theory, 2018, pp. 1691–1692.
  38. M. L. Littman and C. Szepesvári, “A generalized reinforcement-learning model: convergence and applications,” in ICML, vol. 96, 1996, pp. 310–318.
  39. J. Hu, M. P. Wellman et al., “Multiagent reinforcement learning: theoretical framework and an algorithm.” in ICML, 1998, pp. 242–250.
  40. M. Bowling, “Convergence problems of general-sum multiagent reinforcement learning,” in ICML, 2000, pp. 89–94.
  41. J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
  42. M. L. Littman et al., “Friend-or-foe Q-learning in general-sum games,” in ICML, vol. 1, 2001, pp. 322–328.
  43. M. G. Lagoudakis and R. Parr, “Value function approximation in zero-sum markov games,” in Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, 2002, pp. 283–292.
  44. S. Srinivasan, M. Lanctot, V. Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling, “Actor-critic policy optimization in partially observable multiagent environments,” Advances in neural information processing systems, 2018.
  45. J. Perolat, B. Piot, and O. Pietquin, “Actor-critic fictitious play in simultaneous move multistage games,” in International Conference on Artificial Intelligence and Statistics, 2018, pp. 919–928.
  46. J. Perolat, B. Scherrer, B. Piot, and O. Pietquin, “Approximate dynamic programming for two-player zero-sum Markov games,” in International Conference on Machine Learning, 2015, pp. 1321–1329.
  47. J. Pérolat, B. Piot, B. Scherrer, and O. Pietquin, “On the use of non-stationary strategies for solving two-player zero-sum markov games,” in Artificial Intelligence and Statistics, 2016, pp. 893–901.
  48. C.-Y. Wei, Y.-T. Hong, and C.-J. Lu, “Online reinforcement learning in stochastic games,” Advances in Neural Information Processing Systems, 2017.
  49. K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: a selective overview of theories and algorithms,” Handbook of reinforcement learning and control, pp. 321–384, 2021.
  50. T. Wang, M. Bowling, and D. Schuurmans, “Dual representations for dynamic programming and reinforcement learning,” in 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007, pp. 44–51.
  51. A. Gosavi, “Boundedness of iterates in Q-learning,” Systems & Control letters, vol. 55, no. 4, pp. 347–349, 2006.
Citations (2)

Summary

We haven't generated a summary for this paper yet.