Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent (2312.10003v1)

Published 15 Dec 2023 in cs.CL

Abstract: Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a LLM to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

Introduction to the Concept and Approach

The paper presents an enhanced approach to answering complex natural language questions requiring multi-step reasoning and external data sourcing. Substantial advancements have involved integrating knowledge retrieval with LLMs to handle such questions. Unfortunately, these systems exhibit limitations and are not directly trainable end-to-end to rectify these shortcomings. Consequently, the authors introduce a technique that enriches an LLM with the capacity to reason and interact with external knowledge sources. This system is further polished using a ReST-like training protocol that iteratively self-trains on past trajectories, combining reinforcement learning with AI feedback for ongoing self-improvement and self-distillation.

Underlying Agent Architecture

The work is rooted in the ReAct method, combining chain-of-thought reasoning with action and observation in multiple rounds. Here, the Search Agent is tailored with prompts spawning long-form, traceable answers. Challenges lie in refining the agent's robustness and efficacy, which commonly involves acquiring extensive human-labeled data—a process fraught with difficulties. The paper leverages a self-critical method, exploiting AI feedback and synthetic data to enhance the agent's capabilities, diverging from traditional reliance on human-labeled training data.

Improved Training via Self-Improvement Loop

An essential aspect is the application of the ReST algorithm in agent scenarios: the dataset is expanded by sampling from recent policies, and the policy improves through a fixed dataset with a model used as a ranking tool. This is signified by multi-step trajectories culminating in complete assessments and AI-powered direct rankings. The agent's prowess is gauged by its ability to tackle compositional questions that evade simple search engines. Through this iterative process, a large model is fine-tuned, and comparatively less resource-intensive models achieve similar performance, furnishing evidence for the self-improvement and self-distillation capabilities of the method.

Evaluating Agent Performance

The paper adopts two primary datasets, Bamboogle and BamTwoogle, to evaluate the Search Agent. Both datasets consist of questions intentionally crafted to be unsolvable by standard search engines, each requiring various searches for accurate responses. This task serves as a testbed for the agent's effectiveness through human and automated evaluations. The synergy between the iterative training process, AI feedback, and careful pacing of training iterations paves the way for models that show improvements without human data intervention, a significant step forward in the autonomous enhancement of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Vladimir Blagojevic. Long-form qa beyond eli5: an updated dataset and approach, 2022. URL towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb.
  3. Harrison Chase. Langchain. https://github.com/hwchase17/langchain, 2022.
  4. Fireact: Toward language agent fine-tuning, 2023.
  5. Language model cascades, 2022.
  6. Raft: Reward ranked finetuning for generative foundation model alignment, 2023.
  7. ELI5: long form question answering. CoRR, abs/1907.09190, 2019. URL http://arxiv.org/abs/1907.09190.
  8. Pal: Program-aided language models, 2023.
  9. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  10. Large language models cannot self-correct reasoning yet, 2023.
  11. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp, 2023a.
  12. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023b.
  13. Hurdles to progress in long-form question answering, 2021.
  14. Let’s verify step by step, 2023.
  15. Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, 2022.
  16. Self-refine: Iterative refinement with self-feedback, 2023.
  17. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  18. Measuring and narrowing the compositionality gap in language models, 2023.
  19. Iterated decomposition: Improving science q&a by supervising reasoning processes, 2023.
  20. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  21. Beyond human data: Scaling self-training for problem-solving with language models, 2023.
  22. Solving math word problems with process- and outcome-based feedback, 2022.
  23. The rise and potential of large language model based agents: A survey, 2023.
  24. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. CoRR, abs/1809.09600, 2018. URL http://arxiv.org/abs/1809.09600.
  25. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  26. Star: Bootstrapping reasoning with reasoning, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Renat Aksitov (7 papers)
  2. Sobhan Miryoosefi (9 papers)
  3. Zonglin Li (27 papers)
  4. Daliang Li (28 papers)
  5. Sheila Babayan (4 papers)
  6. Kavya Kopparapu (9 papers)
  7. Zachary Fisher (13 papers)
  8. Ruiqi Guo (18 papers)
  9. Sushant Prakash (15 papers)
  10. Pranesh Srinivasan (4 papers)
  11. Manzil Zaheer (89 papers)
  12. Felix Yu (62 papers)
  13. Sanjiv Kumar (123 papers)
Citations (34)
Youtube Logo Streamline Icon: https://streamlinehq.com