Concurrent Learning of Policy and Unknown Safety Constraints in Reinforcement Learning (2402.15893v3)
Abstract: Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints.
- R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021.
- J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
- P. Razzaghi, A. Tabrizian, W. Guo, S. Chen, A. Taye, E. Thompson, A. Bregeon, A. Baheri, and P. Wei, “A survey on reinforcement learning in aviation applications,” arXiv preprint arXiv:2211.02147, 2022.
- B. Hambly, R. Xu, and H. Yang, “Recent advances in reinforcement learning in finance,” Mathematical Finance, vol. 33, no. 3, pp. 437–503, 2023.
- J. García, Fern, and o Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 42, pp. 1437–1480, 2015.
- J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in International conference on machine learning, pp. 22–31, PMLR, 2017.
- M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.
- A. Baheri, “Safe reinforcement learning with mixture density network: A case study in autonomous highway driving,” arXiv preprint arXiv:2007.01698, 2020.
- A. Baheri, “Safe reinforcement learning with mixture density network, with application to autonomous driving,” Results in Control and Optimization, vol. 6, p. 100095, 2022.
- S. Bansal, “Specification-guided reinforcement learning,” in International Static Analysis Symposium, pp. 3–9, Springer, 2022.
- S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll, “A review of safe reinforcement learning: Methods, theory and applications,” arXiv preprint arXiv:2205.10330, 2022.
- S. Qiu, X. Wei, Z. Yang, J. Ye, and Z. Wang, “Upper confidence primal-dual reinforcement learning for CMDP with adversarial loss,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 15277–15287, Curran Associates, Inc., 2020.
- Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” Journal of Machine Learning Research, vol. 18, no. 167, pp. 1–51, 2018.
- T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,” arXiv preprint arXiv:2010.03152, 2020.
- Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman, and M. Ghavamzadeh, “Lyapunov-based safe policy optimization for continuous control,” arXiv preprint arXiv:1901.10031, 2019.
- Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A lyapunov-based approach to safe reinforcement learning,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, (Red Hook, NY, USA), p. 8103–8112, Curran Associates Inc., 2018.
- Y. Luo and T. Ma, “Learning barrier certificates: Towards safe reinforcement learning with zero training-time violations,” Advances in Neural Information Processing Systems, vol. 34, pp. 25621–25632, 2021.
- Y. Yang, Y. Jiang, Y. Liu, J. Chen, and S. E. Li, “Model-free safe reinforcement learning through neural barrier certificate,” IEEE Robotics and Automation Letters, vol. 8, no. 3, pp. 1295–1302, 2023.
- A. Wachi and Y. Sui, “Safe reinforcement learning in constrained Markov decision processes,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 9797–9806, PMLR, 13–18 Jul 2020.
- M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite Markov decision processes with gaussian processes,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, (Red Hook, NY, USA), p. 4312–4320, Curran Associates Inc., 2016.
- A. Wachi, Y. Sui, Y. Yue, and M. Ono, “Safe exploration and optimization of constrained MDPs using gaussian processes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
- M. Hasanbeig, A. Abate, and D. Kroening, “Cautious reinforcement learning with logical constraints,” in Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, p. 483–491, 2020.
- N. Hamilton, P. K. Robinette, and T. T. Johnson, “Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,” in Software Engineering and Formal Methods (B.-H. Schlingloff and M. Chai, eds.), (Cham), pp. 190–206, Springer International Publishing, 2022.
- X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), p. 3834–3839, IEEE Press, 2017.
- A. Balakrishnan and J. V. Deshmukh, “Structured reward shaping using signal temporal logic specifications,” in International Conference on Intelligent Robots and Systems (IROS), pp. 3481–3486, 2019.
- J. Ikemoto and T. Ushio, “Deep reinforcement learning under signal temporal logic constraints using lagrangian relaxation,” IEEE Access, vol. 10, pp. 114814–114828, 2022.
- Z. Kong, A. Jones, A. Medina Ayala, E. Aydin Gol, and C. Belta, “Temporal logic inference for classification and prediction from data,” in Proceedings of the 17th International Conference on Hybrid Systems: Computation and Control, HSCC ’14, (New York, NY, USA), p. 273–282, Association for Computing Machinery, 2014.
- Z. Kong, A. Jones, and C. Belta, “Temporal logics for learning and detection of anomalous behavior,” IEEE Transactions on Automatic Control, vol. 62, no. 3, pp. 1210–1222, 2017.
- A. Jones, Z. Kong, and C. Belta, “Anomaly detection in cyber-physical systems: A formal methods approach,” in 53rd IEEE Conference on Decision and Control, pp. 848–853, 2014.
- P. Vaidyanathan, R. Ivison, G. Bombara, N. A. DeLateur, R. Weiss, D. Densmore, and C. Belta, “Grid-based temporal logic inference,” in 56th Annual Conference on Decision and Control (CDC), pp. 5354–5359, 2017.
- G. Bombara, C.-I. Vasile, F. Penedo, H. Yasuoka, and C. Belta, “A decision tree approach to data classification using signal temporal logic,” in Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control, pp. 1–10, 2016.
- G. Bombara and C. Belta, “Offline and online learning of signal temporal logic formulae using decision trees,” ACM Trans. Cyber-Phys. Syst., vol. 5, mar 2021.
- G. Bombara and C. Belta, “Online learning of temporal logic formulae for signal classification,” in European Control Conference (ECC), pp. 2057–2062, IEEE, 2018.
- G. Bombara and C. Belta, “Signal clustering using temporal logics,” in Runtime Verification: 17th International Conference, RV 2017, Seattle, WA, USA, September 13-16, 2017, Proceedings 17, pp. 121–137, Springer, 2017.
- S. Jha, A. Tiwari, S. Seshia, T. Sahai, and N. Shankar, “Telex: learning signal temporal logic from positive examples using tightness metric,” Formal Methods in System Design, vol. 54, 11 2019.
- L. Nenzi, S. Silvetti, E. Bartocci, and L. Bortolussi, “A robust genetic algorithm for learning temporal specifications from data,” in Quantitative Evaluation of Systems - 15th International Conference, QEST 2018, Beijing, China, September 4-7, 2018, Proceedings (A. McIver and A. Horvath, eds.), vol. 11024, pp. 323–338, Springer, 2018.
- F. Pigozzi, E. Medvet, and L. Nenzi, “Mining road traffic rules with signal temporal logic and grammar-based genetic programming,” Applied Sciences, vol. 11, no. 22, 2021.
- E. Asarin, A. Donzé, O. Maler, and D. Nickovic, “Parametric identification of temporal properties,” in Runtime Verification, (Berlin, Heidelberg), pp. 147–160, Springer Berlin Heidelberg, 2012.
- K. Leung, N. Arechiga, and M. Pavone, “Back-propagation through signal temporal logic specifications: Infusing logical structure into gradient-based methods,” in Algorithmic Foundations of Robotics XIV (S. M. LaValle, M. Lin, T. Ojala, D. Shell, and J. Yu, eds.), (Cham), pp. 432–449, Springer International Publishing, 2021.
- X. Jin, A. Donzé, J. V. Deshmukh, and S. A. Seshia, “Mining requirements from closed-loop control models,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 11, pp. 1704–1717, 2015.
- H. Yang, B. Hoxha, and G. Fainekos, “Querying parametric temporal logic properties on embedded systems,” in Testing Software and Systems (B. Nielsen and C. Weise, eds.), (Berlin, Heidelberg), pp. 136–151, Springer Berlin Heidelberg, 2012.
- M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia, “Logical clustering and learning for time-series data,” in International Conference on Computer Aided Verification, 2016.
- H. Ma, C. Liu, S. E. Li, S. Zheng, and J. Chen, “Joint synthesis of safety certificate and safe control policy using constrained reinforcement learning,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference (R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, eds.), vol. 168 of Proceedings of Machine Learning Research, pp. 97–109, PMLR, 23–24 Jun 2022.
- O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems, pp. 152–166, Springer, 2004.
- MIT press Cambridge, 1998.
- R. Bellman, “On the theory of dynamic programming,” Proceedings of the national Academy of Sciences, vol. 38, no. 8, pp. 716–719, 1952.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
- S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International conference on machine learning, pp. 1587–1596, PMLR, 2018.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- E. Altman, Constrained Markov Decision Processes. Chapman and Hall, 1999.
- P. I. Frazier, “A tutorial on bayesian optimization,” arXiv preprint arXiv:1807.02811, 2018.
- A. H. Victoria and G. Maragatham, “Automatic tuning of hyperparameters using Bayesian optimization,” Evolving Systems, vol. 12, pp. 217–223, 2021.
- A. Baheri and C. Vermillion, “Waypoint optimization using Bayesian optimization: A case study in airborne wind energy systems,” in 2020 American Control Conference (ACC), pp. 5102–5017, IEEE, 2020.
- A. Baheri, S. Bin-Karim, A. Bafandeh, and C. Vermillion, “Real-time control using Bayesian optimization: A case study in airborne wind energy systems,” Control Engineering Practice, vol. 69, pp. 131–140, 2017.
- A. Baheri and C. Vermillion, “Combined plant and controller design using batch bayesian optimization: a case study in airborne wind energy systems,” Journal of Dynamic Systems, Measurement, and Control, vol. 141, no. 9, p. 091013, 2019.
- R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian optimization for learning gaits under uncertainty,” Annals of Mathematics and Artificial Intelligence, vol. 76, no. 1, pp. 5–23, 2016.
- Y. Zhang, D. W. Apley, and W. Chen, “Bayesian optimization for materials design with mixed quantitative and qualitative variables,” Scientific reports, vol. 10, no. 1, p. 4924, 2020.
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: A review of bayesian optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015.
- A. Sinha, P. Malo, and K. Deb, “A review on bilevel optimization: From classical to evolutionary approaches and applications,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 2, pp. 276–295, 2017.
- V. García, R. A. Mollineda, and J. S. Sánchez, “Index of balanced accuracy: A performance measure for skewed class distributions,” in Iberian conference on pattern recognition and image analysis, pp. 441–448, Springer, 2009.
- D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
- J. Ji, J. Zhou, B. Zhang, J. Dai, X. Pan, R. Sun, W. Huang, Y. Geng, M. Liu, and Y. Yang, “Omnisafe: An infrastructure for accelerating safe reinforcement learning research,” arXiv preprint arXiv:2305.09304, 2023.
- J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y. Geng, Y. Zhong, J. Dai, and Y. Yang, “Safety-gymnasium: A unified safe reinforcement learning benchmark,” arXiv preprint arXiv:2310.12567, 2023.
- J. Achiam and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” 2019.
- E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in International Conference on Intelligent Robots and Systems, pp. 5026–5033, IEEE, 2012.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Open AI Gym,” arXiv preprint arXiv:1606.01540, 2016.
- Lunet Yifru (2 papers)
- Ali Baheri (33 papers)