Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models
Abstract: Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained LLMs (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.
- Constrained policy optimization. In International conference on machine learning. PMLR, 22–31.
- Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC press.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Roberto Brunelli. 2009. Template matching techniques in computer vision: theory and practice. John Wiley & Sons.
- Minimalistic gridworld environment for openai gym. (2018).
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7. Springer, 41–75.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692 (2023).
- Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.
- Using natural language for reward shaping in reinforcement learning. arXiv preprint arXiv:1903.02020 (2019).
- A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330 (2022).
- Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636 (2015).
- A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
- Hengyuan Hu and Dorsa Sadigh. 2023. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297 (2023).
- Safety-Gymnasium. https://github.com/PKU-Alignment/safety-gymnasium. GitHub repository (2023).
- Beating atari with natural language guided reinforcement learning. arXiv preprint arXiv:1704.05539 (2017).
- Champion-level drone racing using deep reinforcement learning. Nature 620, 7976 (2023), 982–987.
- Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems 23, 6 (2021), 4909–4926.
- Yasar Sinan Nasir and Dongning Guo. 2019. Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE Journal on Selected Areas in Communications 37, 10 (2019), 2239–2250.
- Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050 (2023).
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- Mastering the game of Stratego with model-free multiagent reinforcement learning. Science 378, 6623 (2022), 990–996.
- Guiding safe reinforcement learning policies using structured language constraints. UMBC Student Collection (2020).
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7, 1 (2019), 2.
- Nils Reimers and Iryna Gurevych. 2019a. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- Nils Reimers and Iryna Gurevych. 2019b. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
- A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866 (2018).
- Constrained update projection approach to safe policy optimization. Advances in Neural Information Processing Systems 35 (2022), 9111–9124.
- Safe reinforcement learning with natural language constraints. Advances in Neural Information Processing Systems 34 (2021), 13794–13808.
- Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152 (2020).
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485 (2023).
- The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science advances 8, 18 (2022), eabk2607.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.