Papers
Topics
Authors
Recent
Search
2000 character limit reached

Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models

Published 15 Jan 2024 in cs.LG and cs.CL | (2401.07553v3)

Abstract: Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained LLMs (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Constrained policy optimization. In International conference on machine learning. PMLR, 22–31.
  2. Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC press.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. Roberto Brunelli. 2009. Template matching techniques in computer vision: theory and practice. John Wiley & Sons.
  6. Minimalistic gridworld environment for openai gym. (2018).
  7. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7. Springer, 41–75.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  9. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692 (2023).
  10. Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.
  11. Using natural language for reward shaping in reinforcement learning. arXiv preprint arXiv:1903.02020 (2019).
  12. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330 (2022).
  13. Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636 (2015).
  14. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
  15. Hengyuan Hu and Dorsa Sadigh. 2023. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297 (2023).
  16. Safety-Gymnasium. https://github.com/PKU-Alignment/safety-gymnasium. GitHub repository (2023).
  17. Beating atari with natural language guided reinforcement learning. arXiv preprint arXiv:1704.05539 (2017).
  18. Champion-level drone racing using deep reinforcement learning. Nature 620, 7976 (2023), 982–987.
  19. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems 23, 6 (2021), 4909–4926.
  20. Yasar Sinan Nasir and Dongning Guo. 2019. Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks. IEEE Journal on Selected Areas in Communications 37, 10 (2019), 2239–2250.
  21. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050 (2023).
  22. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
  23. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science 378, 6623 (2022), 990–996.
  24. Guiding safe reinforcement learning policies using structured language constraints. UMBC Student Collection (2020).
  25. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7, 1 (2019), 2.
  26. Nils Reimers and Iryna Gurevych. 2019a. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  27. Nils Reimers and Iryna Gurevych. 2019b. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
  28. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  31. Attention is all you need. Advances in neural information processing systems 30 (2017).
  32. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 7782 (2019), 350–354.
  33. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
  34. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866 (2018).
  35. Constrained update projection approach to safe policy optimization. Advances in Neural Information Processing Systems 35 (2022), 9111–9124.
  36. Safe reinforcement learning with natural language constraints. Advances in Neural Information Processing Systems 34 (2021), 13794–13808.
  37. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152 (2020).
  38. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
  39. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485 (2023).
  40. The AI Economist: Taxation policy design via two-level deep multiagent reinforcement learning. Science advances 8, 18 (2022), eabk2607.
  41. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.