Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable and Independent Learning of Nash Equilibrium Policies in $n$-Player Stochastic Games with Unknown Independent Chains (2312.01587v1)

Published 4 Dec 2023 in cs.GT and cs.LG

Abstract: We study a subclass of $n$-player stochastic games, namely, stochastic games with independent chains and unknown transition matrices. In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players. However, players' decisions are coupled through their payoff functions. We assume players can receive only realizations of their payoffs, and that the players can not observe the states and actions of other players, nor do they know the transition probability matrices of their own Markov chain. Relying on a compact dual formulation of the game based on occupancy measures and the technique of confidence set to maintain high-probability estimates of the unknown transition matrices, we propose a fully decentralized mirror descent algorithm to learn an $\epsilon$-NE for this class of games. The proposed algorithm has the desired properties of independence, scalability, and convergence. Specifically, under no assumptions on the reward functions, we show the proposed algorithm converges in polynomial time in a weaker distance (namely, the averaged Nikaido-Isoda gap) to the set of $\epsilon$-NE policies with arbitrarily high probability. Moreover, assuming the existence of a variationally stable Nash equilibrium policy, we show that the proposed algorithm converges asymptotically to the stable $\epsilon$-NE policy with arbitrarily high probability. In addition to Markov potential games and linear-quadratic stochastic games, this work provides another subclass of $n$-player stochastic games that, under some mild assumptions, admit polynomial-time learning algorithms for finding their stationary $\epsilon$-NE policies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  2. C. Daskalakis, D. J. Foster, and N. Golowich, “Independent policy gradient methods for competitive reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 5527–5540, 2020.
  3. C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, “The complexity of computing a Nash equilibrium,” SIAM Journal on Computing, vol. 39, no. 1, pp. 195–259, 2009.
  4. L. S. Shapley, “Stochastic games,” Proceedings of the National Academy of Sciences, vol. 39, no. 10, pp. 1095–1100, 1953.
  5. Y. Zhao, Y. Tian, J. D. Lee, and S. S. Du, “Provably efficient policy gradient methods for two-player zero-sum Markov games,” arXiv preprint arXiv:2102.08903, 2021.
  6. S. Qiu, X. Wei, J. Ye, Z. Wang, and Z. Yang, “Provably efficient fictitious play policy optimization for zero-sum Markov games with structured transitions,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8715–8725.
  7. M. A. uz Zaman, K. Zhang, E. Miehling, and T. Bașar, “Reinforcement learning in non-stationary discrete-time linear-quadratic mean-field games,” in 2020 59th IEEE Conference on Decision and Control (CDC).   IEEE, 2020, pp. 2278–2284.
  8. E. Meigs, F. Parise, and A. Ozdaglar, “Learning in repeated stochastic network aggregative games,” in 2019 IEEE 58th Conference on Decision and Control (CDC).   IEEE, 2019, pp. 6918–6923.
  9. R. Zhang, Z. Ren, and N. Li, “Gradient play in multi-agent Markov stochastic games: Stationary points and convergence,” arXiv preprint arXiv:2106.00198, 2021.
  10. S. Leonardos, W. Overman, I. Panageas, and G. Piliouras, “Global convergence of multi-agent policy gradient in Markov potential games,” arXiv preprint arXiv:2106.01969, 2021.
  11. E. Altman, K. Avratchenkov, N. Bonneau, M. Debbah, R. El-Azouzi, and D. S. Menasché, “Constrained stochastic games in wireless networks,” in IEEE GLOBECOM 2007-IEEE Global Telecommunications Conference.   IEEE, 2007, pp. 315–320.
  12. S. R. Etesami, “Learning stationary Nash equilibrium policies in n𝑛nitalic_n-player stochastic games with independent chains,” SIAM Journal on Control and Optimization (to appear), arXiv preprint arXiv:2201.12224, 2022.
  13. S. R. Etesami, W. Saad, N. B. Mandayam, and H. V. Poor, “Stochastic games for the smart grid energy management with prospect prosumers,” IEEE Transactions on Automatic Control, vol. 63, no. 8, pp. 2327–2342, 2018.
  14. P. Narayanan and L. N. Theagarajan, “Large player games on wireless networks,” arXiv preprint arXiv:1710.08800, 2017.
  15. E. Altman, K. Avrachenkov, N. Bonneau, M. Debbah, R. El-Azouzi, and D. S. Menasche, “Constrained cost-coupled stochastic games with independent state processes,” Operations Research Letters, vol. 36, no. 2, pp. 160–164, 2008.
  16. J. F. Nash et al., “Equilibrium points in n𝑛nitalic_n-person games,” Proceedings of the National Academy of Sciences, vol. 36, no. 1, pp. 48–49, 1950.
  17. Y. Tian, Y. Wang, T. Yu, and S. Sra, “Online learning in unknown Markov games,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 279–10 288.
  18. M. Sayin, K. Zhang, D. Leslie, T. Başar, and A. Ozdaglar, “Decentralized Q-learning in zero-sum Markov games,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 320–18 334, 2021.
  19. M. O. Sayin, F. Parise, and A. Ozdaglar, “Fictitious play in zero-sum stochastic games,” SIAM Journal on Control and Optimization, vol. 60, no. 4, pp. 2095–2114, 2022.
  20. C.-Y. Wei, C.-W. Lee, M. Zhang, and H. Luo, “Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games,” in Conference on Learning Theory.   PMLR, 2021, pp. 4259–4299.
  21. B. M. Hambly, R. Xu, and H. Yang, “Policy gradient methods find the Nash equilibrium in N-player general-sum linear-quadratic games,” Journal of Machine Learning Research, vol. 24, no. 139, pp. 1–56, 2023.
  22. S. V. Macua, J. Zazo, and S. Zazo, “Learning parametric closed-loop policies for Markov potential games,” arXiv preprint arXiv:1802.00899, 2018.
  23. D. Mguni, Y. Wu, Y. Du, Y. Yang, Z. Wang, M. Li, Y. Wen, J. Jennings, and J. Wang, “Learning in nonzero-sum stochastic games with potentials,” arXiv preprint arXiv:2103.09284, 2021.
  24. C.-W. Lee, H. Luo, C.-Y. Wei, and M. Zhang, “Bias no more: High-probability data-dependent regret bounds for adversarial bandits and MDPs,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 522–15 533, 2020.
  25. Y. Chen, J. Dong, and Z. Wang, “A primal-dual approach to constrained Markov decision processes,” arXiv preprint arXiv:2101.10895, 2021.
  26. Y. Jin and A. Sidford, “Efficiently solving MDPs with stochastic mirror descent,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4890–4900.
  27. A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,” Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, 2021.
  28. M. Wang, “Primal-dual π𝜋\piitalic_π learning: Sample complexity and sublinear run time for ergodic Markov decision problems,” arXiv preprint arXiv:1710.06100, 2017.
  29. A. R. Cardoso, H. Wang, and H. Xu, “Large scale Markov decision processes with changing rewards,” in 33rd Conference on Neural Information Processing Systems 32 (NIPS), 2019, pp. 1–11.
  30. C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu, “Learning adversarial Markov decision processes with bandit feedback and unknown transition,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4860–4869.
  31. A. Rosenberg and Y. Mansour, “Online stochastic shortest path with bandit feedback and unknown transition function,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  32. V. V. Singh and N. Hemachandra, “A characterization of stationary Nash equilibria of constrained stochastic games with independent state processes,” Operations Research Letters, vol. 42, no. 1, pp. 48–52, 2014.
  33. W. Zhang and X. Zou, “Constrained average stochastic games with continuous-time independent state processes,” Optimization, vol. 71, no. 9, pp. 2571–2594, 2022.
  34. G. Neu, A. György, C. Szepesvári et al., “The online loop-free stochastic shortest-path problem,” in COLT, vol. 2010.   Citeseer, 2010, pp. 231–243.
  35. H. Nikaidô and K. Isoda, “Note on non-cooperative convex games,” 1955.
  36. P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets and unknown payoff functions,” Mathematical Programming, vol. 173, pp. 465–507, 2019.
  37. J. B. Rosen, “Existence and uniqueness of equilibrium points for concave n𝑛nitalic_n-person games,” Econometrica: Journal of the Econometric Society, pp. 520–534, 1965.
  38. G. Neu, “Explore no more: Improved high-probability regret bounds for non-stochastic bandits,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  39. M. O. Sayin, “Decentralized learning for stochastic games: Beyond zero sum and identical interest,” arXiv preprint arXiv:2310.07256, 2023.
  40. G. Chen and M. Teboulle, “Convergence analysis of a proximal-like minimization algorithm using Bregman functions,” SIAM Journal on Optimization, vol. 3, no. 3, pp. 538–543, 1993.
Citations (2)

Summary

We haven't generated a summary for this paper yet.