Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL (2402.19446v1)

Published 29 Feb 2024 in cs.LG, cs.AI, and cs.CL
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Abstract: A broad use case of LLMs is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).

Hierarchical Reinforcement Learning for LLMs Achieves Improved Sample Efficiency

Introduction to ArCHer Framework

Agent tasks involving decision-making over multiple turns, where actions in one turn can affect outcomes in subsequent interactions, pose unique challenges in the field of reinforcement learning (RL), particularly when employing LLMs. Traditional RL methods often fall short in this context due to their focus on single-turn reward maximization, which fails to address the complexities inherent in multi-turn interactions. To bridge this gap, we introduce the Actor-Critic Framework with a Hierarchical Structure (ArCHer), a novel algorithmic approach designed to fine-tune LLMs for complex, multi-turn agent tasks by employing hierarchical reinforcement learning.

Key Contributions and Framework Overview

ArCHer distinguishes itself by operating on two levels: a high-level (utterance-level) and a low-level (token-level), running parallel RL algorithms at each tier. This dual structure offers several advantages, including improved sample efficiency, the ability to handle long horizons and delayed rewards more effectively, and scalability to larger model capacities. Specifically, the hierarchical approach allows for the segmentation of decision-making processes into manageable parts, thereby reducing the complexity of credit assignment and enhancing the capacity for long-term planning. Empirically, ArCHer has demonstrated significantly better efficiency and performance on multi-turn tasks, achieving about 100x improvement in sample efficiency over existing on-policy methods, while also benefiting from an increase in model capacity—up to a 7 billion parameter scale in tested experiments.

Theoretical Implications and Practical Benefits

ArCHer's design addresses several key theoretical challenges in training LLMs for agent tasks, including overcoming the difficulties associated with long training horizons and ensuring meaningful policy improvement beyond narrow constraints. The framework's flexibility in integrating various components for both high- and low-level methods opens new avenues for further research and development in hierarchical RL algorithms. Moreover, our findings suggest that ArCHer not only facilitates a more natural and effective way to leverage off-policy data but also underscores the importance of hierarchical structures in overcoming the limitations of existing RL approaches for LLMs.

Future Directions

While our paper has validated the efficacy of ArCHer with a focus on computational environments, extending the framework to support learning from live interactions with humans presents an exciting challenge for future work. This includes adapting the methodology to conditions where only a limited number of interactions are practical or feasible. Furthermore, exploring the potential of model-based RL techniques within the ArCHer framework could yield additional gains in both performance and efficiency. Continual research into novel instantiations of the ArCHer framework and its application across a broader range of tasks and models will undoubtedly enrich our understanding and capabilities in harnessing LLMs for sophisticated multi-turn decision-making tasks.

Closing Remarks

The development of the Actor-Critic Framework with a Hierarchical Structure marks a significant step forward in the application of reinforcement learning to LLMs, particularly in complex, multi-turn environments. By addressing the inherent challenges of sample efficiency, credit assignment, and scalability, ArCHer paves the way for more advanced, efficient, and capable LLM-based agents capable of tackling a wide array of decision-making problems with notable proficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  3. Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. CoRR, abs/1810.00278, 2018. URL http://arxiv.org/abs/1810.00278.
  4. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
  5. Fireact: Toward language agent fine-tuning. ArXiv, abs/2310.05915, 2023. URL https://api.semanticscholar.org/CorpusID:263829338.
  6. Deep reinforcement learning from human preferences, 2023.
  7. Scaling instruction-finetuned language models, 2022.
  8. Offline reinforcement learning: Fundamental barriers for value function approximation. CoRR, abs/2111.10919, 2021. URL https://arxiv.org/abs/2111.10919.
  9. A minimalist approach to offline reinforcement learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  10. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  11. Cicero: A dataset for contextualized commonsense inference in dialogues. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:247762111.
  12. Improving alignment of dialogue agents via targeted human judgements, 2022.
  13. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  14. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018. URL http://arxiv.org/abs/1801.01290.
  15. Interactive fiction games: A colossal adventure. CoRR, abs/1909.05398, 2019. URL http://arxiv.org/abs/1909.05398.
  16. Zero-shot goal-directed dialogue via rl on imagined conversations, 2023.
  17. GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=qaxhBG1UUaS.
  18. Human-centric dialog training via offline reinforcement learning. CoRR, abs/2010.05848, 2020. URL https://arxiv.org/abs/2010.05848.
  19. Mistral 7b, 2023.
  20. Pretraining language models with human preferences, 2023.
  21. Offline reinforcement learning with implicit q-learning, 2021.
  22. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  23. Conservative q-learning for offline reinforcement learning. CoRR, abs/2006.04779, 2020. URL https://arxiv.org/abs/2006.04779.
  24. Convlab: Multi-domain end-to-end dialog system platform. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  25. Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022. URL https://api.semanticscholar.org/CorpusID:246527904.
  26. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. CoRR, abs/1802.08979, 2018. URL http://arxiv.org/abs/1802.08979.
  27. Agentbench: Evaluating llms as agents, 2023.
  28. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  29. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.
  30. The primacy bias in deep reinforcement learning. In International conference on machine learning, pages 16828–16847. PMLR, 2022.
  31. Gpt-4 technical report, 2023.
  32. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
  33. A deep reinforced model for abstractive summarization, 2017.
  34. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. CoRR, abs/1910.00177, 2019. URL http://arxiv.org/abs/1910.00177.
  35. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  36. Direct preference optimization: Your language model is secretly a reward model, 2023.
  37. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. 2022. URL https://arxiv.org/abs/2210.01241.
  38. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization, 2023.
  39. Sequence level training with recurrent neural networks, 2015.
  40. Toolformer: Language models can teach themselves to use tools, 2023.
  41. High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438, 2015. URL https://api.semanticscholar.org/CorpusID:3075448.
  42. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  43. Reflexion: Language agents with verbal reinforcement learning. 2023. URL https://api.semanticscholar.org/CorpusID:258833055.
  44. Offline rl for natural language generation with implicit language q learning, 2023.
  45. Hybrid rl: Using both offline and online data can make rl efficient, 2023.
  46. Policy gradient methods for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf.
  47. Large language models as generalizable policies for embodied tasks, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models, 2023.
  49. Deep reinforcement learning with double q-learning. CoRR, abs/1509.06461, 2015. URL http://arxiv.org/abs/1509.06461.
  50. Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning, 2022.
  51. Voyager: An open-ended embodied agent with large language models. ArXiv, abs/2305.16291, 2023. URL https://api.semanticscholar.org/CorpusID:258887849.
  52. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 2004. URL https://api.semanticscholar.org/CorpusID:19115634.
  53. Learning to extract coherent summary via deep reinforcement learning, 2018.
  54. Bellman-consistent pessimism for offline reinforcement learning. CoRR, abs/2106.06926, 2021. URL https://arxiv.org/abs/2106.06926.
  55. Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023a.
  56. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. arXiv preprint arXiv:2306.15626, 2023b.
  57. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023a.
  58. React: Synergizing reasoning and acting in language models, 2023b.
  59. Rrhf: Rank responses to align language models with human feedback without tears, 2023.
  60. Andrea Zanette. When is realizability sufficient for off-policy reinforcement learning?, 2023.
  61. Agenttuning: Enabling generalized agent abilities for llms, 2023.
  62. Offline reinforcement learning with realizability and single-policy concentrability, 2022.
  63. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023a. URL https://api.semanticscholar.org/CorpusID:260164780.
  64. Offline data enhanced on-policy policy gradient with provable guarantees, 2023b.
  65. Convlab-2: An open-source toolkit for building, evaluating, and diagnosing dialogue systems. CoRR, abs/2002.04793, 2020. URL https://arxiv.org/abs/2002.04793.
  66. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yifei Zhou (24 papers)
  2. Andrea Zanette (21 papers)
  3. Jiayi Pan (19 papers)
  4. Sergey Levine (531 papers)
  5. Aviral Kumar (74 papers)
Citations (24)

HackerNews