On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes (2403.06806v1)
Abstract: We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.
- Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR, 2019.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift, 2020.
- Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. arXiv preprint arXiv:2309.01922, 2023.
- Direct gradient-based reinforcement learning. In 2000 IEEE International Symposium on Circuits and Systems (ISCAS), volume 3, pages 271–274. IEEE, 2000.
- Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM, 2014.
- Dimitri P Bertsekas et al. Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific, 1, 2011.
- Global optimality guarantees for policy gradient methods. Operations Research, 2024.
- Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Value iteration for controlled markov chains with risk sensitive cost criterion. In Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No. 99CH36304), volume 1, pages 126–130. IEEE, 1999.
- Convex optimization. Cambridge university press, 2004.
- Xi-Ren Cao. Single sample path-based optimization of markov chains. Journal of Optimization Theory and Applications, 100:527–548, 1999. doi: 10.1023/A:1022634422482.
- Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- Global convergence of policy gradient methods for the linear quadratic regulator. In International conference on machine learning, pages 1467–1476. PMLR, 2018.
- Long-term resource allocation fairness in average markov decision process (amdp) environment. arXiv preprint arXiv:2102.07120, 2021.
- Abhijit Gosavi. A reinforcement learning algorithm based on policy iteration for average reward: Empirical results with yield management and convergence analysis. Machine Learning, 55:5–29, 2004.
- Reducing blackwell and average optimality to discounted mdps via the blackwell discount factor. Advances in Neural Information Processing Systems, 36, 2024.
- Convergence for natural policy gradient on infinite-state average-reward markov decision processes. arXiv preprint arXiv:2402.05274, 2024.
- On the linear convergence of natural policy gradient algorithm. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 3794–3799. IEEE, 2021.
- Actor-critic algorithms. In Neural Information Processing Systems, 1999.
- Towards faster global convergence of robust policy gradient methods. In Sixteenth European Workshop on Reinforcement Learning, 2023.
- Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine learning, 22(1):159–195, 1996.
- On the global convergence rates of softmax policy gradient methods. In International conference on machine learning, pages 6820–6829. PMLR, 2020.
- Yashaswini Murthy and R Srikant. On the convergence of natural policy gradient and mirror descent-like policy methods for average-reward mdps. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1979–1984. IEEE, 2023.
- Performance bounds for policy-based average reward reinforcement learning algorithms. Advances in Neural Information Processing Systems, 36, 2024.
- Markov decision processes and its applications in healthcare. Handbook of healthcare delivery systems. CRC, Boca Raton, 2011.
- Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
- Sheldon M. Ross. Introduction to Stochastic Dynamic Programming: Probability and Mathematical. Academic Press, Inc., USA, 1983. ISBN 0125984200.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Model-based average reward reinforcement learning. Artificial intelligence, 100(1-2):177–224, 1998.
- Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999.
- Lin Xiao. On the convergence rates of policy gradient methods, 2022a.
- Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022b.
- Variational policy gradient method for reinforcement learning with general utilities, 2020.
- Navdeep Kumar (11 papers)
- Yashaswini Murthy (9 papers)
- Itai Shufaro (2 papers)
- Kfir Y. Levy (39 papers)
- R. Srikant (90 papers)
- Shie Mannor (228 papers)