Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Truly No-Regret Learning in Constrained MDPs (2402.15776v3)

Published 24 Feb 2024 in cs.LG and stat.ML

Abstract: Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations -- one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
  2. Eitan Altman. Constrained Markov decision processes, volume 7. CRC press, 1999.
  3. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
  4. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
  5. Amir Beck. First-order methods in optimization. SIAM, 2017.
  6. Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957. ISSN 00959057, 19435274. URL http://www.jstor.org/stable/24900506.
  7. Convex analysis and optimization, volume 1. Athena Scientific, 2003.
  8. Foundations of data science. Cambridge University Press, 2020.
  9. Vivek S. Borkar. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 1988.
  10. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
  11. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
  12. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards. IEEE Transactions on Automatic Control, 2023.
  13. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
  14. Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  15. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  16. Policy gradient primal-dual mirror descent for constrained mdps with large state spaces. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 4892–4897. IEEE, 2022.
  17. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
  18. Convergence and optimality of policy gradient primal-dual method for constrained markov decision processes. In 2022 American Control Conference (ACC), pages 2851–2856. IEEE, 2022a.
  19. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346, 2022b.
  20. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2206.02346, 2022c.
  21. Last-iterate convergent policy gradient primal-dual methods for constrained mdps. arXiv preprint arXiv:2306.11700, 2023.
  22. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. arXiv preprint arXiv:2201.11965, 2022.
  23. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
  24. Tight regret bounds for model-based reinforcement learning with greedy policies. Advances in Neural Information Processing Systems, 32, 2019.
  25. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  26. Provably efficient model-free constrained rl with linear function approximation. Advances in Neural Information Processing Systems, 35:13303–13315, 2022.
  27. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
  28. Faster algorithm and sharper analysis for constrained markov decision process. arXiv preprint arXiv:2110.10351, 2021.
  29. Policy optimization for constrained mdps with provable fast global convergence, 2021a. URL https://arxiv.org/abs/2111.00552.
  30. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021b.
  31. Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps. arXiv preprint arXiv:2302.01275, 2023.
  32. Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  33. Constrained reinforcement learning has zero duality gap. arXiv preprint arXiv:1910.13393, 2019.
  34. Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 2022.
  35. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  36. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. Advances in Neural Information Processing Systems, 33:15277–15287, 2020.
  37. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604–8613. PMLR, 2020.
  38. Maurice Sion. On general minimax theorems. Pacific J. Math, 1958.
  39. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pages 9133–9143. PMLR, 2020.
  40. Reinforcement learning: An introduction. MIT press, 2018.
  41. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
  42. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  43. A dual approach to constrained markov decision processes with entropy regularization. In International Conference on Artificial Intelligence and Statistics, pages 1887–1909. PMLR, 2022.
  44. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets