Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization (2310.06903v2)
Abstract: Safe Reinforcement Learning (RL) plays an important role in applying RL algorithms to safety-critical real-world applications, addressing the trade-off between maximizing rewards and adhering to safety constraints. This work introduces a novel approach that combines RL with trajectory optimization to manage this trade-off effectively. Our approach embeds safety constraints within the action space of a modified Markov Decision Process (MDP). The RL agent produces a sequence of actions that are transformed into safe trajectories by a trajectory optimizer, thereby effectively ensuring safety and increasing training stability. This novel approach excels in its performance on challenging Safety Gym tasks, achieving significantly higher rewards and near-zero safety violations during inference. The method's real-world applicability is demonstrated through a safe and effective deployment in a real robot task of box-pushing around obstacles.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
- I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al., “Solving rubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113, 2019.
- O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” The International Journal of Robotics Research, vol. 39, no. 1, pp. 3–20, 2020.
- B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021.
- H. A. Pierson and M. S. Gashler, “Deep learning in robotics: a review of recent research,” Advanced Robotics, vol. 31, no. 16, pp. 821–835, 2017.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
- J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in International Conference on Machine Learning. PMLR, 2017, pp. 22–31.
- A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” arXiv preprint arXiv:1910.01708, vol. 7, 2019.
- Y. As, I. Usmanova, S. Curi, and A. Krause, “Constrained policy optimization via bayesian world models,” arXiv preprint arXiv:2201.09802, 2022.
- Z. Liu, Z. Cen, V. Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao, “Constrained variational policy optimization for safe reinforcement learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 13 644–13 668.
- T.-H. Pham, G. De Magistris, and R. Tachibana, “Optlayer-practical constrained optimization for deep reinforcement learning in the real world,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6236–6243.
- G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.
- A. K. Jayant and S. Bhatnagar, “Model-based safe deep reinforcement learning via a constrained proximal policy optimization algorithm,” arXiv preprint arXiv:2210.07573, 2022.
- M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” arXiv preprint arXiv:1705.08551, 2017.
- T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration,” in 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 6059–6066.
- D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Advances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020.
- S. Bohez, A. Abdolmaleki, M. Neunert, J. Buchli, N. Heess, and R. Hadsell, “Value constrained model-free continuous control,” arXiv preprint arXiv:1902.04623, 2019.
- Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” arXiv preprint arXiv:1805.07708, 2018.
- T. Xu, Y. Liang, and G. Lan, “Crpo: A new approach for safe reinforcement learning with convergence guarantee,” in International Conference on Machine Learning. PMLR, 2021, pp. 11 480–11 491.
- T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,” arXiv preprint arXiv:2010.03152, 2020.
- M. Likhachev, G. J. Gordon, and S. Thrun, “Ara*: Anytime a* with provable bounds on sub-optimality,” Advances in neural information processing systems, vol. 16, 2003.
- S. Koenig and M. Likhachev, “D^* lite,” Aaai/iaai, vol. 15, pp. 476–483, 2002.
- P. Fiorini and Z. Shiller, “Motion planning in dynamic environments using velocity obstacles,” The international journal of robotics research, vol. 17, no. 7, pp. 760–772, 1998.
- T. Howell, N. Gileadi, S. Tunyasuvunakool, K. Zakka, T. Erez, and Y. Tassa, “Predictive sampling: Real-time behaviour synthesis with mujoco,” arXiv preprint arXiv:2212.00541, 2022.
- T. A. Howell, S. L. Cleac’h, K. Tracy, and Z. Manchester, “Calipso: A differentiable solver for trajectory optimization with conic and complementarity constraints,” arXiv preprint arXiv:2205.09255, 2022.
- T. A. Howell, S. Le Cleac’h, S. Singh, P. Florence, Z. Manchester, and V. Sindhwani, “Trajectory optimization with optimization-based dynamics,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 6750–6757, 2022.
- M. Bhardwaj, B. Boots, and M. Mukadam, “Differentiable gaussian process motion planning,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 10 598–10 604.
- K. Li and J. Malik, “Learning to optimize,” arXiv preprint arXiv:1606.01885, 2016.
- B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable mpc for end-to-end planning and control,” Advances in neural information processing systems, vol. 31, 2018.
- M. Vlastelica, S. Blaes, C. Pinneri, and G. Martius, “Risk-averse zero-order trajectory optimization,” in 5th Annual Conference on Robot Learning, 2021.
- M. Schrum, M. J. Connolly, E. Cole, M. Ghetiya, R. Gross, and M. C. Gombolay, “Meta-active learning in probabilistically safe optimization,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 713–10 720, 2022.
- F. Xia, C. Li, R. Martín-Martín, O. Litany, A. Toshev, and S. Savarese, “Relmogen: Integrating motion generation in reinforcement learning for mobile manipulation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4583–4590.
- Y. Jiang, F. Yang, S. Zhang, and P. Stone, “Integrating task-motion planning with reinforcement learning for robust decision making in mobile robots,” arXiv preprint arXiv:1811.08955, 2018.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
- T. D. Barfoot, C. H. Tong, and S. Särkkä, “Batch continuous-time trajectory estimation as exactly sparse gaussian process regression.” in Robotics: Science and Systems, vol. 10. Citeseer, 2014, pp. 1–10.
- M. Mukadam, J. Dong, X. Yan, F. Dellaert, and B. Boots, “Continuous-time gaussian process motion planning via probabilistic inference,” The International Journal of Robotics Research, vol. 37, no. 11, pp. 1319–1340, 2018.
- S. Wright, J. Nocedal, et al., “Numerical optimization,” Springer Science, vol. 35, no. 67-68, p. 7, 1999.
- L. Pineda, T. Fan, M. Monge, S. Venkataraman, P. Sodhi, R. Chen, J. Ortiz, D. DeTone, A. Wang, S. Anderson, et al., “Theseus: A library for differentiable nonlinear optimization,” arXiv preprint arXiv:2207.09442, 2022.
- H. Yu, W. Xu, and H. Zhang, “Towards safe reinforcement learning with a safety editor policy,” arXiv preprint arXiv:2201.12427, 2022.
- E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
- Fan Yang (878 papers)
- Wenxuan Zhou (61 papers)
- Zuxin Liu (43 papers)
- Ding Zhao (172 papers)
- David Held (81 papers)