LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models (2311.18232v1)
Abstract: LLMs provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
- Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- Better rewards yield better summaries: Learning to summarise without references. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3110–3120, Hong Kong, China, November 2019. Association for Computational Linguistics. 10.18653/v1/D19-1307. URL https://aclanthology.org/D19-1307.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W/W16/W16-2301.
- Language models are few-shot learners, 2020.
- Grounding large language models in interactive environments with online reinforcement learning, 2023.
- Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
- trlX: A scalable framework for RLHF, June 2023. URL https://github.com/CarperAI/trlx.
- Evaluating large language models trained on code, 2021.
- Deep reinforcement learning from human preferences, 2023.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Textworld: A learning environment for text-based games. CoRR, abs/1806.11532, 2018.
- Strategic dialogue management via deep reinforcement learning, 2015.
- Human-level play in the game of <i>diplomacy</i> by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. 10.1126/science.ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
- Reinforcement learning of argumentation dialogue policies in negotiation. pages 2073–2076, 08 2011. 10.21437/Interspeech.2011-544.
- Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43–48, Vancouver, Canada, July 2017. Association for Computational Linguistics. URL https://aclanthology.org/P17-4008.
- Google. Bard, 2023. URL https://bard.google.com/.
- Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968–1978, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 10.18653/v1/D17-1210. URL https://aclanthology.org/D17-1210.
- Interactive fiction games: A colossal adventure. CoRR, abs/1909.05398, 2019. URL http://arxiv.org/abs/1909.05398.
- Decoupling strategy and generation in negotiation dialogues, 2018.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021a.
- What would jiminy cricket do? towards agents that behave morally. NeurIPS, 2021b.
- Rewarding chatbots for real-world engagement with millions of users. arXiv preprint arXiv:2303.06135, 2023.
- GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=qaxhBG1UUaS.
- Human-centric dialog training via offline reinforcement learning. Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Sequence tutor: Conservative fine-tuning of sequence generation models with KL-control. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1645–1654. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/jaques17a.html.
- Revisiting the weaknesses of reinforcement learning for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1673–1681, Online, June 2021. Association for Computational Linguistics. 10.18653/v1/2021.naacl-main.133. URL https://aclanthology.org/2021.naacl-main.133.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- A survey on recent advances and challenges in reinforcement learning methods for task-oriented dialogue policy learning, 2022.
- Deal or no deal? end-to-end learning for negotiation dialogues, 2017.
- Deep reinforcement learning for dialogue generation, 2016.
- Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330, 1993.
- Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022.
- Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
- Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Lisbon, Portugal, September 2015. Association for Computational Linguistics. 10.18653/v1/D15-1001. URL https://aclanthology.org/D15-1001.
- Reinforcement learning for bandit neural machine translation with simulated human feedback. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464–1474, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. 10.18653/v1/D17-1153. URL https://aclanthology.org/D17-1153.
- OpenAI. Chatgpt, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. Gpt-4, 2023. URL https://openai.com/research/gpt-4.
- Training language models to follow instructions with human feedback, 2022.
- Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=RovX-uQ1Hua.
- Generative agents: Interactive simulacra of human behavior, 2023.
- A deep reinforced model for abstractive summarization, 2017.
- Reinforced clarification question generation with defeasibility rewards for disambiguating social and moral situations, 2022.
- Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=8aHzds2uUyB.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- High-dimensional continuous control using generalized advantage estimation, 2018.
- Interactive reinforcement learning for task-oriented dialogue management. 2016.
- Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion, 2022a.
- Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage, 2022b.
- Reinforcement learning for spoken dialogue systems. Advances in neural information processing systems, 12, 1999.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022a.
- Context-aware language modeling for goal-oriented dialogue systems. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2351–2366, Seattle, United States, July 2022b. Association for Computational Linguistics. 10.18653/v1/2022.findings-naacl.181. URL https://aclanthology.org/2022.findings-naacl.181.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Learning to summarize from human feedback, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Controllable neural story plot generation via reward shaping. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, aug 2019. 10.24963/ijcai.2019/829. URL https://doi.org/10.24963%2Fijcai.2019%2F829.
- Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
- Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, volume 4, pages 142–147, 2003.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning, 2022.
- Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
- Scienceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540, 2022a.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022b.
- Codet5+: Open code large language models for code understanding and generation, 2023.
- Finetuned language models are zero-shot learners, 2022.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1804.08198, 2018.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016.
- Learning to extract coherent summary via deep reinforcement learning, 2018.
- Webshop: Towards scalable real-world web interaction with grounded language agents, 2023.
- Fine-tuning language models from human preferences, 2020.
- Marwa Abdulhai (8 papers)
- Isadora White (3 papers)
- Charlie Snell (16 papers)
- Charles Sun (3 papers)
- Joey Hong (23 papers)
- Yuexiang Zhai (18 papers)
- Kelvin Xu (25 papers)
- Sergey Levine (531 papers)