Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control
Abstract: Policy gradient methods, where one searches for the policy of interest by maximizing the value functions using first-order information, become increasingly popular for sequential decision making in reinforcement learning, games, and control. Guaranteeing the global optimality of policy gradient methods, however, is highly nontrivial due to nonconcavity of the value functions. In this exposition, we highlight recent progresses in understanding and developing policy gradient methods with global convergence guarantees, putting an emphasis on their finite-time convergence rates with regard to salient problem parameters.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506.
- Optimal control: linear quadratic methods. Courier Corporation.
- Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716.
- Bertsekas, D. P. (2017). Dynamic programming and optimal control (4th edition). Athena Scientific.
- Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- On the linear convergence of policy gradient methods for finite mdps. In International Conference on Artificial Intelligence and Statistics, pages 2386–2394. PMLR.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578.
- Faster last-iterate convergence of policy optimization in zero-sum markov games. In International Conference on Learning Representations.
- Fast policy extragradient methods for competitive games with entropy regularization. Advances in Neural Information Processing Systems, 34:27952–27964.
- Prediction, learning, and games. Cambridge university press.
- Near-optimal no-regret learning in general games. Advances in Neural Information Processing Systems, 34:27604–27616.
- Independent policy gradient methods for competitive reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 5527–5540.
- Last-iterate convergence: Zero-sum games and constrained min-max optimization. In Innovations in Theoretical Computer Science.
- Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476.
- Competitive Markov decision processes. Springer Science & Business Media.
- Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361.
- Toward a theoretical foundation of policy optimization for learning control policies. Annual Review of Control, Robotics, and Autonomous Systems, 6:123–158.
- Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538.
- On the linear convergence of natural policy gradient algorithm. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 3794–3799. IEEE.
- Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059–1106.
- Block policy mirror descent. SIAM Journal on Optimization, 33(3):2341–2378.
- Softmax policy gradient methods can take exponential time to converge. Mathematical Programming, pages 1–96.
- Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38.
- On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR.
- Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717. SIAM.
- Neumann, J. V. (1928). Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320.
- Symmetric (optimistic) natural policy gradient for multi-agent learning with parameter convergence. In International Conference on Artificial Intelligence and Statistics, pages 5641–5685. PMLR.
- Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26.
- Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675.
- Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100.
- A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In International Conference on Learning Representations (ICLR).
- Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Conference on learning theory, pages 4259–4299. PMLR.
- Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations (ICLR).
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256.
- Xiao, L. (2022). On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1):12887–12922.
- Regularized gradient descent ascent for two-player zero-sum Markov games. Advances in Neural Information Processing Systems, 35.
- Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. SIAM Journal on Optimization, 33(2):1061–1091.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.