Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs (2206.02346v3)
Abstract: We study sequential decision making problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we use computational experiments to showcase the merits and the effectiveness of our approach.
- Self learning control of constrained Markov chains — a gradient approach. In IEEE Conference on Decision and Control, volume 2, 2002.
- Policy gradient stochastic approximation algorithms for adaptive control of constrained time varying Markov decision processes. In IEEE International Conference on Decision and Control, volume 3, pages 2823–2828, 2003.
- Optimizing debt collections using constrained reinforcement learning. In International Conference on Knowledge Discovery and Data Mining, pages 75–84, 2010.
- Constrained policy optimization. In International Conference on Machine Learning, volume 70, pages 22–31, 2017.
- PC-PG: Policy cover directed exploration for provable policy gradient learning. In International Conference on Neural Information Processing Systems, 2020.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
- Eitan Altman. Constrained Markov Decision Processes, volume 7. CRC Press, 1999.
- Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning, 2021.
- Kenneth J Arrow. Studies in Linear and Non-linear Programming. Stanford University Press, 1958.
- Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)𝑂1𝑛{O}(1/n)italic_O ( 1 / italic_n ). In Advances in neural information processing systems, pages 773–781, 2013.
- Model-free algorithm and regret analysis for MDPs with peak constraints. arXiv preprint arXiv:2003.05555, 2020.
- Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In AAAI Conference on Artificial Intelligence, 2022.
- Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 6737–6744, 2023.
- Amir Beck. First-order Methods in Optimization, volume 25. SIAM, 2017.
- Dimitri P Bertsekas. Nonlinear Programming: Second Edition. Athena Scientific, 2008.
- Dimitri P Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, 2014.
- Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
- On the linear convergence of policy gradient methods for finite MDPs. In International Conference on Artificial Intelligence and Statistics, pages 2386–2394, 2021.
- Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
- Vivek S Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
- OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 2021.
- Prediction, Learning, and Games. Cambridge University Press, 2006.
- Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
- A Lyapunov-based approach to safe reinforcement learning. In International Conference on Neural Information Processing Systems, 2018.
- Natural policy gradient primal-dual method for constrained Markov decision processes. In International Conference on Neural Information Processing Systems, 2020.
- Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312, 2021.
- On the global optimum convergence of momentum-based policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 1910–1934, 2022.
- Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
- Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies. arXiv preprint arXiv:2302.01734, 2023.
- Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
- A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- Towards painless policy optimization for constrained MDPs. In Uncertainty in Artificial Intelligence, pages 895–905, 2022.
- Borong Zhang Juntao Dai Xuehai Pan Ruiyang Sun Weidong Huang Yiran Geng Mickel Liu Yaodong Yang Jiaming Ji, Jiayi Zhou. Omnisafe: An infrastructure for accelerating safe reinforcement learning research. arXiv preprint arXiv:2305.09304, 2023.
- S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531–1538, 2002.
- Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
- On linear and super-linear convergence of natural policy gradient algorithm. Systems & Control Letters, 164:105214, 2022.
- A simpler approach to obtaining an O(1/t)1𝑡(1/t)( 1 / italic_t ) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
- Faster algorithm and sharper analysis for constrained Markov decision process. arXiv preprint arXiv:2110.10351, 2021.
- Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, 2019.
- Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, pages 10564–10575, 2019.
- Learning policies with zero or bounded constraint violation for constrained MDPs. In International Conference on Neural Information Processing Systems, 2021a.
- Policy optimization for constrained MDPs with provable fast global convergence. arXiv preprint arXiv:2111.00552, 2021b.
- An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. In Advances in Neural Information Processing Systems, 2020a.
- IPO: Interior-point policy optimization under constraints. In AAAI Conference on Artificial Intelligence, volume 34, 2020b.
- Linear and Nonlinear Programming, volume 2. Springer, 1984.
- Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503–2528, 2012.
- Derivative-free methods for policy optimization: Guarantees for linear-quadratic systems. J. Mach. Learn. Res., 51:1–51, 2020.
- On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, 2020.
- A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698, 2022.
- Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
- Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator. In Proceedings of the 58th IEEE Conference on Decision and Control, pages 7474–7479, Nice, France, 2019.
- Learning the model-free linear quadratic regulator via random search. In Proceedings of Machine Learning Research, 2nd Annual Conference on Learning for Dynamics and Control, volume 120, pages 1–9, Berkeley, CA, 2020.
- On the linear convergence of random search for discrete-time LQR. IEEE Control Syst. Lett., 5(3):989–994, July 2021.
- Convergence and sample complexity of gradient methods for the model-free linear-quadratic regulator problem. IEEE Trans. Automat. Control, 67(5):2435–2450, May 2022.
- Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
- Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
- Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pages 7553–7563, 2019.
- Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 2022.
- Upper confidence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions. In Advances in Neural Information Processing Systems, 2020.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1):2, 2019.
- Constrained Markov decision processes via backward value functions. In International Conference on Machine Learning, 2020.
- Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
- Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
- Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. In AAAI Conference on Artificial Intelligence, 2020.
- Learning in Markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
- A natural actor-critic algorithm with downside risk constraints. arXiv preprint arXiv:2007.04203, 2020.
- Responsive safety in reinforcement learning by PID Lagrangian methods. In International Conference on Machine Learning, pages 9133–9143, 2020.
- Stagewise safe Bayesian optimization with Gaussian processes. In International Conference on Machine Learning, 2018.
- Reinforcement Learning: An Introduction. MIT Press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
- Continuity of optimal solution functions and their conditions on objective functions. SIAM Journal on Optimization, 25(4):2050–2060, 2015.
- Reward constrained policy optimization. In International Conference on Learning Representations, 2019.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
- Constrained reinforcement learning from intrinsic and extrinsic rewards. In International Conference on Development and Learning, pages 163–168, 2007.
- Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2019.
- A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes. In AAAI Conference on Artificial Intelligence, 2022.
- Online primal-dual mirror descent under stochastic constraints. International Conference on Measurement and Modeling of Computer Systems, 2020.
- CRPO: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, 2021.
- Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems. In Advances in Neural Information Processing Systems, 2020a.
- Constrained update projection approach to safe policy optimization. In International Conference on Neural Information Processing Systems, 2022.
- Projection-based constrained policy optimization. In International Conference on Learning Representations, 2020b.
- A dual approach to constrained markov decision processes with entropy regularization. In International Conference on Artificial Intelligence and Statistics, pages 1887–1909, 2022.
- Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1428–1438, 2017.
- Convergent policy optimization for safe reinforcement learning. In Advances in Neural Information Processing Systems, pages 3121–3133, 2019.
- Provably efficient algorithms for multi-objective competitive RL. In International Conference on Machine Learning, 2021.
- Online convex optimization for cumulative constraints. In Advances in Neural Information Processing Systems, pages 6137–6146, 2018.
- Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, volume 134, pages 4473–4525, 2021.
- Finite-time complexity of online primal-dual natural actor-critic algorithm for constrained Markov decision processes. In IEEE 61st Conference on Decision and Control, pages 4028–4033, 2022.
- Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572–4583, 2020a.
- Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 2019a.
- Non-cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 9482–9493, 2019b.
- First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33, 2020b.
- Constrained upper confidence reinforcement learning. In Conference on Learning for Dynamics and Control, 2020.