Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

State-wise Constrained Policy Optimization (2306.12594v3)

Published 21 Jun 2023 in cs.LG and cs.RO

Abstract: Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  2. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  3. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  4. Reinforcement learning with automated auxiliary loss search. arXiv preprint arXiv:2210.06041, 2022.
  5. Stochastic variance reduction for deep q-learning. arXiv preprint arXiv:1905.08152, 2019a.
  6. Approximation gradient error variance reduced optimization. In Workshop on Reinforcement Learning in Games (RLG) at The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 2019b.
  7. Human motion prediction using semi-adaptable neural networks. In 2019 American Control Conference (ACC), pages 4884–4890. IEEE, 2019.
  8. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  9. Benchmarking safe exploration in deep reinforcement learning. CoRR, abs/1910.01708, 2019.
  10. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017a.
  11. Policy learning with constraints in model-free reinforcement learning: A survey. In The 30th International Joint Conference on Artificial Intelligence (IJCAI), 2021.
  12. Experimental evaluation of human motion prediction toward safe and efficient human robot collaboration. In 2020 American Control Conference (ACC), pages 4349–4354. IEEE, 2020a.
  13. State-wise safe reinforcement learning: A survey. The 32nd International Joint Conference on Artificial Intelligence (IJCAI), 2023a.
  14. A hierarchical long short term safety framework for efficient robot manipulation under uncertainty. Robotics and Computer-Integrated Manufacturing, 82:102522, 2023a.
  15. Contact-rich trajectory generation in confined environments using iterative convex optimization. arXiv preprint arXiv:2008.03826, 2020b.
  16. Provably safe tolerance estimation for robot arms via sum-of-squares programming. IEEE Control Systems Letters, 6:3439–3444, 2022a.
  17. Hybrid task constrained planner for robot manipulator in confined environment. arXiv preprint arXiv:2304.09260, 2023.
  18. Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
  19. Safety index synthesis via sum-of-squares programming. arXiv preprint arXiv:2209.09134, 2022b.
  20. Probabilistic safeguard for reinforcement learning using safety index guided gaussian process models. In Learning for Dynamics and Control Conference, pages 783–796. PMLR, 2023b.
  21. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017b.
  22. IPO: interior-point policy optimization under constraints. CoRR, abs/1910.09615, 2019. URL http://arxiv.org/abs/1910.09615.
  23. Projection-based constrained policy optimization. CoRR, abs/2010.03152, 2020a. URL https://arxiv.org/abs/2010.03152.
  24. Safe adaptation with multiplicative uncertainties using robust safe set algorithm. IFAC-PapersOnLine, 54(20):360–365, 2021.
  25. Persistently feasible robust safe control by safety index synthesis and convex semi-infinite programming. IEEE Control Systems Letters, 2022.
  26. Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497, 2020.
  27. Evaluating model-free reinforcement learning toward safety-critical tasks. arXiv preprint arXiv:2212.05727, 2022a.
  28. Lyapunov-based safe policy optimization for continuous control. ICML 2019 Workshop RL4RealLife, abs/1901.10031, 2019.
  29. Safe exploration in continuous action spaces. CoRR, abs/1801.08757, 2018.
  30. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480, 2018.
  31. arXiv preprint arXiv:1805.11074, 2018.
  32. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623, 2019.
  33. Autocost: Evolving intrinsic cost for zero-violation reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2023b.
  34. Feasible actor-critic: Constrained reinforcement learning for ensuring statewise safety. arXiv preprint arXiv:2105.10682, 2021.
  35. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
  36. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  37. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003.
  38. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
  39. Guard: A safe reinforcement learning benchmark. arXiv preprint arXiv:2305.13681, 2023c.
  40. Ipo: Interior-point policy optimization under constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 4940–4947, 2020.
  41. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020b.
  42. Saferl-kit: Evaluating efficient reinforcement learning methods for safe autonomous driving. arXiv preprint arXiv:2206.08528, 2022b.
  43. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weiye Zhao (24 papers)
  2. Rui Chen (310 papers)
  3. Yifan Sun (183 papers)
  4. Tianhao Wei (25 papers)
  5. Changliu Liu (134 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com