Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constrained Reinforcement Learning Under Model Mismatch (2405.01327v2)

Published 2 May 2024 in cs.LG

Abstract: Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment. However, when deployed in a real environment, it may easily violate constraints that were originally satisfied during training because there might be model mismatch between the training and real environments. To address the above challenge, we formulate the problem as constrained RL under model uncertainty, where the goal is to learn a good policy that optimizes the reward and at the same time satisfy the constraint under model mismatch. We develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training. We demonstrate the effectiveness of our algorithm on a set of RL tasks with constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Constrained policy optimization. In Proc. International Conference on Machine Learning (ICML), pp.  22–31. PMLR, 2017.
  2. Altman, E. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
  3. Near-optimal regret bounds for reinforcement learning. In Proc. Advances in Neural Information Processing Systems (NIPS), volume 21, 2008.
  4. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  5. Dope: Doubly optimistic and pessimistic exploration for safe reinforcement learning. Advances in Neural Information Processing Systems, 35:1047–1059, 2022.
  6. A lyapunov-based approach to safe reinforcement learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018.
  7. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
  8. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
  9. Natural policy gradient primal-dual method for constrained Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.  8378–8390, 2020.
  10. Provably efficient safe exploration via primal-dual policy optimization. In Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), pp.  3304–3312. PMLR, 2021.
  11. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning, pp.  1329–1338. PMLR, 2016.
  12. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
  13. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
  14. Safety-first AI for autonomous data centre cooling and industrial control. DeepMind blog, 2018.
  15. Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  16. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp.  267–274, 2002.
  17. Trust region-based safe distributional reinforcement learning for multiple constraints. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  18. Deep reinforcement learning for autonomous driving: A survey. arXiv preprint arXiv:2002.00444, 2020.
  19. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  20. Faster algorithm and sharper analysis for constrained Markov decision process. arXiv preprint arXiv:2110.10351, 2021.
  21. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
  22. Kernel-based reinforcement learning in robust Markov decision processes. In Proc. International Conference on Machine Learning (ICML), pp.  3973–3981. PMLR, 2019.
  23. Fast global convergence of policy optimization for constrained MDPs. arXiv preprint arXiv:2111.00552, 2021.
  24. Ipo: Interior-point policy optimization under constraints. In Proc. Conference on Artificial Intelligence (AAAI), volume 34, pp.  4940–4947, 2020.
  25. Robust constrained reinforcement learning for continuous control with model misspecification. arXiv preprint arXiv:2010.10644, 2020.
  26. Robustness in Markov decision problems with uncertain transition matrices. In Proc. Advances in Neural Information Processing Systems (NIPS), pp.  839–846, 2004.
  27. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
  28. Sample complexity of robust reinforcement learning with a generative model. arXiv preprint arXiv:2112.01506, 2021.
  29. Constrained reinforcement learning has zero duality gap. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  30. Reinforcement learning under model mismatch. In Proc. Advances in Neural Information Processing Systems (NIPS), pp.  3046–3055, 2017.
  31. Robust constrained-MDPs: Soft-constrained robust policy optimization under model uncertainty. arXiv preprint arXiv:2010.04870, 2020.
  32. Trust region policy optimization. In Proc. International Conference on Machine Learning (ICML), pp.  1889–1897. PMLR, 2015a.
  33. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
  34. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767, 2022.
  35. Robust pest management using reinforcement learning. The Multi-disciplinary Conference on Reinforcement Learning and Decision Making, 2019.
  36. Responsive safety in reinforcement learning by pid lagrangian methods. In Proc. International Conference on Machine Learning (ICML), pp.  9133–9143. PMLR, 2020.
  37. Reinforcement learning: An introduction. MIT press, 2018.
  38. Policy gradient methods for reinforcement learning with function approximation. In Proc. Advances in Neural Information Processing Systems (NIPS), volume 99, pp.  1057–1063. Citeseer, 1999.
  39. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
  40. Policy gradient in robust mdps with global convergence guarantee, 2023a.
  41. Online robust reinforcement learning with model uncertainty. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  7193–7206, 2021.
  42. Policy gradient method for robust reinforcement learning. In Proc. International Conference on Machine Learning (ICML), volume 162, pp.  23484–23526. PMLR, 2022.
  43. Robust constrained reinforcement learning. arXiv preprint arXiv:2209.06866, 2022.
  44. Robust average-reward markov decision processes. In Proc. Conference on Artificial Intelligence (AAAI), 2023b.
  45. Model-free robust average-reward reinforcement learning. In International Conference on Machine Learning, pp.  36431–36469. PMLR, 2023c.
  46. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  3274–3307. PMLR, 28–30 Mar 2022.
  47. Distributionally robust Markov decision processes. In Proc. Advances in Neural Information Processing Systems (NIPS), pp.  2505–2513, 2010.
  48. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In Proc. International Conference on Machine Learning (ICML), pp.  11480–11491. PMLR, 2021.
  49. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2019.
  50. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
  51. Towards theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. arXiv preprint arXiv:2105.03863, 2021.
  52. A dual approach to constrained Markov decision processes with entropy regularization. arXiv preprint arXiv:2110.08923, 2021.
  53. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2019a.
  54. Convergent policy optimization for safe reinforcement learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019b.
  55. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33:15338–15349, 2020.
  56. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pp.  620–629. PMLR, 2020.
  57. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), pp.  3331–3339. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhongchang Sun (9 papers)
  2. Sihong He (14 papers)
  3. Fei Miao (33 papers)
  4. Shaofeng Zou (53 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets