Introduction
LLMs have evolved into autonomous language agents, exhibiting the capability to undertake independent tasks driven by objectives, as opposed to merely answering queries. Recent advances such as ReAct, Toolformer, and Langchain have demonstrated the utilization of LLMs for autonomous decision-making through text-based outputs that can trigger API calls and operations within specific environments. While LLMs can generate text and actions that align with extensive parameter counts, most are not optimized in concert with environment-specific reward functions. The few that are, utilize verbal feedback for iterative refinement but lack compatibility with gradient-based learning, which harnesses reinforcement learning techniques. Retroformer addresses this gap by reinforcing LLMs to refine prompts through policy gradient optimization, thereby leveraging environment feedback for action plans and reflecting on prior failures to improve performance.
Related Work
Discussing the latest in autonomous language agents, Retroformer situates itself within a developing corpus of research aimed at task completion involving several stages. Previous works like Chain-of-Thought pioneered decomposing complex reasoning tasks, while approaches like ReAct harnessed these faculties of LLMs for interaction with digital environments. However, most of these models do not learn from environmental rewards, impacting their performance. Some, like Reflexion, enhance agent's skills through self-reflection but still do not utilize gradient signals explicitly. In contrast, the policy gradient optimization at the core of Retroformer enables effective planning and decision-making by learning from environment feedback.
Challenges & Intuition
Applying LLMs based agents to problems involving tool use and action presents several challenges, such as generating spurious actions, limited prompt lengths, heuristic prompt engineering, and difficulties with direct LLM optimization. Classical reinforcement learning (RL) agents, though less adept in zero-shot settings for text-rich environments, exemplify ongoing improvement through environmental feedback. Retroformer harnesses classical RL optimization, such as policy gradient algorithms, to iteratively enhance model performance, while steering clear of direct fine-tuning of the LLM, which makes it a robust yet straightforward approach for empowering agents with state or memory.
Reinforcing Retrospective Language Agent
Retroformer employs a dual component of an actor LLM and a retrospective LLM. The actor is a fixed LLM, while the retrospective one is a smaller LM refined through RL techniques. The retrospective LM is fine-tuned to provide feedback, effectively serving as prompt for the LLM agent. By integrating a policy gradient optimization method, Retroformer facilitates the learning from arbitrary reward information across multiple environments and tasks, allowing for iterative refinements that bolster agent learning speed and task performance success rates. Experimentation demonstrates Retroformer's capacity to outperform baselines on tasks like HotPotQA, showcasing the utility of gradient-based reasoning and planning.