Introduction to the Concept and Approach
The paper presents an enhanced approach to answering complex natural language questions requiring multi-step reasoning and external data sourcing. Substantial advancements have involved integrating knowledge retrieval with LLMs to handle such questions. Unfortunately, these systems exhibit limitations and are not directly trainable end-to-end to rectify these shortcomings. Consequently, the authors introduce a technique that enriches an LLM with the capacity to reason and interact with external knowledge sources. This system is further polished using a ReST-like training protocol that iteratively self-trains on past trajectories, combining reinforcement learning with AI feedback for ongoing self-improvement and self-distillation.
Underlying Agent Architecture
The work is rooted in the ReAct method, combining chain-of-thought reasoning with action and observation in multiple rounds. Here, the Search Agent is tailored with prompts spawning long-form, traceable answers. Challenges lie in refining the agent's robustness and efficacy, which commonly involves acquiring extensive human-labeled data—a process fraught with difficulties. The paper leverages a self-critical method, exploiting AI feedback and synthetic data to enhance the agent's capabilities, diverging from traditional reliance on human-labeled training data.
Improved Training via Self-Improvement Loop
An essential aspect is the application of the ReST algorithm in agent scenarios: the dataset is expanded by sampling from recent policies, and the policy improves through a fixed dataset with a model used as a ranking tool. This is signified by multi-step trajectories culminating in complete assessments and AI-powered direct rankings. The agent's prowess is gauged by its ability to tackle compositional questions that evade simple search engines. Through this iterative process, a large model is fine-tuned, and comparatively less resource-intensive models achieve similar performance, furnishing evidence for the self-improvement and self-distillation capabilities of the method.
Evaluating Agent Performance
The paper adopts two primary datasets, Bamboogle and BamTwoogle, to evaluate the Search Agent. Both datasets consist of questions intentionally crafted to be unsolvable by standard search engines, each requiring various searches for accurate responses. This task serves as a testbed for the agent's effectiveness through human and automated evaluations. The synergy between the iterative training process, AI feedback, and careful pacing of training iterations paves the way for models that show improvements without human data intervention, a significant step forward in the autonomous enhancement of LLMs.