Reflexion: Language Agents with Verbal Reinforcement Learning (2303.11366v4)

Published 20 Mar 2023 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

PDF HTML Abstract

Overview

The field of LLMs has witnessed a significant transformation with the introduction of Reflexion, a novel paradigm that shifts the focus towards enhancing language agents through verbal reinforcement. This method diverges from traditional reinforcement learning techniques, which predominantly rely on extensive training data and model fine-tuning, by employing linguistic feedback for agent improvement.

The Essence of Reflexion

Reflexion stands out by allowing agents to internally generate reflective textual feedback based on their performance in various tasks. This reflective feedback is then stored in an episodic memory, enabling the agent to make more informed decisions in future attempts. This process mirrors human learning patterns where reflection on past experiences leads to improved future actions. Remarkably, Reflexion is versatile, accommodating different types of feedback signals and sources, whether they are external or internally simulated.

Comparative Advantages

Traditional reinforcement learning (RL) methods, though effective, come with their set of challenges, including the need for substantial computational power and intricacies in performing accurate credit assignments with scalar or vector rewards. Reflexion addresses these challenges by:

Being computationally efficient as it doesn't necessitate fine-tuning of the LLM.
Offering a nuanced feedback system that transcends basic scalar or vector rewards, thus providing more targeted action adjustments.
Enabling a more explicit and interpretable episodic memory of prior experiences.
Furnishing more explicit action hints for future episodes.

Empirical Evidences

The effectiveness of Reflexion is underscored by its impressive performance across a spectrum of tasks including sequential decision-making, reasoning, and programming. Notably, it achieved a 91% pass@1 accuracy on the HumanEval coding benchmark, outperforming the previous state-of-the-art GPT-4, which secured an 80% accuracy. This stark improvement highlights Reflexion's potential to redefine benchmarks in generative AI tasks.

Experimental Insights

Reflexion's integration into tasks like the AlfWorld suite and HotPotQA showcased its ability to substantially boost agent performance by up to 22% and 20% respectively over traditional approaches. These experiments underline Reflexion’s proficiency in not only interpreting the task at hand but also in leveraging past experiences to enhance future task execution. In programming tasks, Reflexion not only set new benchmarks in code generation accuracy but also demonstrated its language-agnostic capability, offering promising implications for a wide range of programming languages.

Limitations and Future Directions

While Reflexion introduces a groundbreaking approach to enabling agents to learn from linguistic feedback, it's essential to acknowledge its limitations. The simplification of retaining episodic memory to a fixed size may not always encapsulate the depth of experiences needed for complex decision-making. Future work could explore the expansion of memory mechanisms and delve into more sophisticated models that encompass a broader spectrum of learning strategies, mirroring human cognitive processes more closely.

Conclusion

Reflexion represents a significant leap forward in the development of intelligent language agents, offering a novel and effective approach to learning through verbal reinforcement. By enabling agents to self-reflect and learn from their experiences, Reflexion poses to significantly advance the capabilities of generative AI, pushing the boundaries of what's possible in autonomous decision-making and reasoning tasks.