Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probabilistic Constraint for Safety-Critical Reinforcement Learning (2306.17279v2)

Published 29 Jun 2023 in cs.LG and cs.AI

Abstract: In this paper, we consider the problem of learning safe policies for probabilistic-constrained reinforcement learning (RL). Specifically, a safe policy or controller is one that, with high probability, maintains the trajectory of the agent in a given safe set. We establish a connection between this probabilistic-constrained setting and the cumulative-constrained formulation that is frequently explored in the existing literature. We provide theoretical bounds elucidating that the probabilistic-constrained setting offers a better trade-off in terms of optimality and safety (constraint satisfaction). The challenge encountered when dealing with the probabilistic constraints, as explored in this work, arises from the absence of explicit expressions for their gradients. Our prior work provides such an explicit gradient expression for probabilistic constraints which we term Safe Policy Gradient-REINFORCE (SPG-REINFORCE). In this work, we provide an improved gradient SPG-Actor-Critic that leads to a lower variance than SPG-REINFORCE, which is substantiated by our theoretical results. A noteworthy aspect of both SPGs is their inherent algorithm independence, rendering them versatile for application across a range of policy-based algorithms. Furthermore, we propose a Safe Primal-Dual algorithm that can leverage both SPGs to learn safe policies. It is subsequently followed by theoretical analyses that encompass the convergence of the algorithm, as well as the near-optimality and feasibility on average. In addition, we test the proposed approaches by a series of empirical experiments. These experiments aim to examine and analyze the inherent trade-offs between the optimality and safety, and serve to substantiate the efficacy of two SPGs, as well as our theoretical contributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  2. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
  3. S. E. Shreve and D. P. Bertsekas, “Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control,” SIAM Journal on control and optimization, vol. 16, no. 6, pp. 953–978, 1978.
  4. S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
  5. Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International conference on machine learning.   PMLR, 2016, pp. 1329–1338.
  6. R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3, pp. 229–256, 1992.
  7. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  8. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  9. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  10. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  11. C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3, pp. 279–292, 1992.
  12. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, vol. 12, 1999.
  13. J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
  14. T. Van Cutsem, “Voltage instability: phenomena, countermeasures, and analysis methods,” Proceedings of the IEEE, vol. 88, no. 2, pp. 208–227, 2000.
  15. G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 5129–5136.
  16. P. Geibel, “Reinforcement learning for mdps with constraints,” in European Conference on Machine Learning.   Springer, 2006, pp. 646–653.
  17. Y. Kadota, M. Kurano, and M. Yasuda, “Discounted markov decision processes with utility constraints,” Computers & Mathematics with Applications, vol. 51, no. 2, pp. 279–284, 2006.
  18. Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017.
  19. R. A. Howard and J. E. Matheson, “Risk-sensitive markov decision processes,” Management science, vol. 18, no. 7, pp. 356–369, 1972.
  20. M. Sato, H. Kimura, and S. Kobayashi, “Td algorithm for the variance of return and mean-variance reinforcement learning,” Transactions of the Japanese Society for Artificial Intelligence, vol. 16, no. 3, pp. 353–362, 2001.
  21. P. Geibel and F. Wysotzki, “Risk-sensitive reinforcement learning applied to control under constraints,” Journal of Artificial Intelligence Research, vol. 24, pp. 81–108, 2005.
  22. J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “Ai safety gridworlds,” arXiv preprint arXiv:1711.09883, 2017.
  23. H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” arXiv preprint arXiv:1803.07055, 2018.
  24. V. S. Borkar, “An actor-critic algorithm for constrained markov decision processes,” Systems & control letters, vol. 54, no. 3, pp. 207–213, 2005.
  25. S. Bhatnagar and K. Lakshmanan, “An online actor–critic algorithm with function approximation for constrained markov decision processes,” Journal of Optimization Theory and Applications, vol. 153, no. 3, pp. 688–708, 2012.
  26. Q. Liang, F. Que, and E. Modiano, “Accelerated primal-dual policy optimization for safe reinforcement learning,” arXiv preprint arXiv:1802.06480, 2018.
  27. J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in International conference on machine learning.   PMLR, 2017, pp. 22–31.
  28. C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” arXiv preprint arXiv:1805.11074, 2018.
  29. T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,” arXiv preprint arXiv:2010.03152, 2020.
  30. Y. Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020.
  31. Y. Liu, J. Ding, and X. Liu, “Ipo: Interior-point policy optimization under constraints,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 4940–4947.
  32. L. Shen, L. Yang, S. Chen, B. Yuan, X. Wang, D. Tao et al., “Penalized proximal policy optimization for safe reinforcement learning,” arXiv preprint arXiv:2205.11814, 2022.
  33. Y. Censor, “Pareto optimality in multiobjective problems,” Applied Mathematics and Optimization, vol. 4, no. 1, pp. 41–59, 1977.
  34. R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 3387–3395.
  35. J. Zhang, B. Cheung, C. Finn, S. Levine, and D. Jayaraman, “Cautious adaptation for reinforcement learning in safety-critical settings,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 055–11 065.
  36. D. Corsi, E. Marchesini, and A. Farinelli, “Formal verification of neural networks for safety-critical tasks in deep reinforcement learning,” in Uncertainty in Artificial Intelligence.   PMLR, 2021, pp. 333–343.
  37. E. Delage and S. Mannor, “Percentile optimization for markov decision processes with parameter uncertainty,” Operations research, vol. 58, no. 1, pp. 203–213, 2010.
  38. S. Paternain, M. Calvo-Fullana, L. F. Chamon, and A. Ribeiro, “Safe policies for reinforcement learning via primal-dual methods,” IEEE Transactions on Automatic Control, 2022.
  39. M. Calvo-Fullana, L. F. Chamon, and S. Paternain, “Towards safe continuing task reinforcement learning,” in 2021 American Control Conference (ACC).   IEEE, 2021, pp. 902–908.
  40. W. Chen, D. Subramanian, and S. Paternain, “Policy gradients for probabilistic constrained reinforcement learning,” in 2023 57th Annual Conference on Information Sciences and Systems (CISS).   IEEE, 2023, pp. 1–6.
  41. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
  42. D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
  43. N. Wagener, B. Boots, and C.-A. Cheng, “Safe reinforcement learning using advantage-based intervention,” arXiv preprint arXiv:2106.09110, 2021.
  44. J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural computation, vol. 3, no. 2, pp. 246–257, 1991.
  45. B. Sriperumbudur, K. Fukumizu, and G. Lanckriet, “On the relation between universality, characteristic kernels and rkhs embedding of measures,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 773–780.
  46. K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
  47. H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
  48. V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural information processing systems, vol. 12, 1999.
  49. M. Calvo-Fullana, S. Paternain, L. F. Chamon, and A. Ribeiro, “State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards,” arXiv preprint arXiv:2102.11941, 2021.
Citations (10)

Summary

We haven't generated a summary for this paper yet.