Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Payoff-Based Policy Gradient Method in Stochastic Games with Long-Run Average Payoffs (2405.09811v1)

Published 16 May 2024 in cs.GT

Abstract: Despite the significant potential for various applications, stochastic games with long-run average payoffs have received limited scholarly attention, particularly concerning the development of learning algorithms for them due to the challenges of mathematical analysis. In this paper, we study the stochastic games with long-run average payoffs and present an equivalent formulation for individual payoff gradients by defining advantage functions which will be proved to be bounded. This discovery allows us to demonstrate that the individual payoff gradient function is Lipschitz continuous with respect to the policy profile and that the value function of the games exhibits the gradient dominance property. Leveraging these insights, we devise a payoff-based gradient estimation approach and integrate it with the Regularized Robbins-Monro method from stochastic approximation theory to construct a bandit learning algorithm suited for stochastic games with long-run average payoffs. Additionally, we prove that if all players adopt our algorithm, the policy profile employed will asymptotically converge to a Nash equilibrium with probability one, provided that all Nash equilibria are globally neutrally stable and a globally variationally stable Nash equilibrium exists. This condition represents a wide class of games, including monotone games.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), June 2010.
  2. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31:167–175, 2003.
  3. Ulrich Berger. Brown’s original fictitious play. Journal of Economic Theory, 135(1):572–578, 2007.
  4. Deep multi agent reinforcement learning for autonomous driving. In Cyril Goutte and Xiaodan Zhu, editors, Advances in Artificial Intelligence, pages 67–78, Cham, 2020. Springer International Publishing.
  5. Bandit learning in concave n-person games. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 5666–5676, Red Hook, NY, USA, 2018. Curran Associates Inc.
  6. Prediction, Learning, and Games. Cambridge University Press, 2006.
  7. Xi Chen and Xiaotie Deng. Settling the complexity of two-player nash equilibrium. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 261–272, 2006.
  8. John Darzentas. Problem Complexity and Method Efficiency in Optimization. Journal of the Operational Research Society, 35(5):455–455, May 1984.
  9. Independent policy gradient methods for competitive reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  10. On the complexity of computing Markov perfect equilibrium in general-sum stochastic games. National Science Review, 10(1), November 2022. _eprint: https://academic.oup.com/nsr/article-pdf/10/1/nwac256/49182748/nwac256.pdf.
  11. S. Rasoul Etesami. Learning stationary nash equilibrium policies in n𝑛nitalic_n-player stochastic games with independent chains. SIAM Journal on Control and Optimization, 62(2):799–825, 2024.
  12. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
  13. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics.
  14. On the convergence of policy gradient methods to nash equilibria in general stochastic games. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
  15. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023.
  16. Policy gradient methods find the nash equilibrium in n-player general-sum linear-quadratic games. J. Mach. Learn. Res., 24(1), mar 2024.
  17. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  18. Nash q-learning for general-sum stochastic games. J. Mach. Learn. Res., 4:1039–1069, dec 2003.
  19. V-learning—a simple, efficient, decentralized algorithm for multiagent reinforcement learning. Mathematics of Operations Research, 2023.
  20. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
  21. A unified stochastic approximation framework for learning in games. Mathematical Programming, 203(1):559–609, January 2024.
  22. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465–507, January 2019.
  23. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951.
  24. R. Tyrrell Rockafellar and Roger J. B. Wets. Variational Analysis. Springer Berlin, Heidelberg, 1998.
  25. J. B. Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 33(3):520–534, 1965.
  26. Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, feb 2012.
  27. L. S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953.
  28. Stochastic games. Proceedings of the National Academy of Sciences, 112(45):13743–13746, 2015.
  29. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2022.
  30. James C. Spall. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, jan 1997.
  31. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
  32. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, page 1057–1063, Cambridge, MA, USA, 1999. MIT Press.
  33. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms, pages 321–384. Springer International Publishing, Cham, 2021.
  34. Gradient play in stochastic games: Stationary points and local geometry. IFAC-PapersOnLine, 55(30):73–78, 2022. 25th International Symposium on Mathematical Theory of Networks and Systems MTNS 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com