Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
This presentation examines a novel reinforcement learning framework called EMPO² that addresses critical exploration challenges in training large language model agents. The system combines parametric policy updates with non-parametric episodic memory through hybrid on- and off-policy optimization, achieving substantial performance gains on complex reasoning benchmarks while improving both sample efficiency and generalization to unseen tasks.Script
When reinforcement learning meets language models, a silent killer emerges: the inability to explore. Agents repeat the same failures, trapped in behavioral loops, unable to discover the novel states that could unlock success.
Consider an agent tasked with turning on a red light bulb. It searches the same empty rooms over and over, never finding the bulb, never changing strategy. More training steps don't help because the agent has no mechanism to break out of its routine and explore differently.
The authors propose a framework that gives agents two complementary tools: memory to guide exploration and a learning mechanism that internalizes those insights.
EMPO² operates on two axes simultaneously. During exploration, the agent can retrieve relevant tips from its episodic memory buffer to guide decision-making. But critically, it also trains in both memory-augmented and memory-free modes, forcing the parametric policy to internalize successful behaviors rather than remaining dependent on external memory.
The system creates a virtuous cycle. After each trajectory, the agent reflects on what happened and generates a tip summarizing the key lesson. These tips accumulate in a searchable memory buffer. When the agent encounters a similar situation later, it retrieves the most relevant past insights, dramatically improving its ability to navigate complex, multi-step tasks where naive exploration would fail.
The framework's flexibility comes from combining rollout and update modes in three distinct configurations. The agent can learn purely from its current policy, learn while using memory for guidance, or distill knowledge from memory-augmented rollouts into the base policy. This third mode is crucial: it ensures the agent eventually becomes capable even when memory isn't available.
To ensure agents don't stop exploring once they find any working strategy, EMPO² adds intrinsic motivation based on state visitation novelty. This reward bonus keeps the policy from collapsing into repetitive behaviors and accelerates discovery of high-quality solutions. Ablations confirm this exploration incentive is essential for reaching optimal performance.
The empirical results are striking. On ScienceWorld, a benchmark requiring complex multi-step reasoning, EMPO² more than doubles the performance of the baseline GRPO algorithm. Even more impressive: the improvements persist when tested without memory at inference time, proving the agent has genuinely learned better exploration strategies, not just memorized specific solutions.
This comparison reveals why exploration matters. When intrinsic rewards are removed entirely, learning stalls. Different reward coefficients and alternative bonuses like Random Network Distillation produce varied convergence speeds, but all exploration-driven variants substantially outperform purely extrinsic reward learning. The takeaway is clear: systematic exploration isn't optional for these complex reasoning environments.
EMPO² demonstrates that effective exploration in language model agents requires more than just better prompting or larger models. It demands a principled integration of memory, hybrid learning modes, and curiosity-driven incentives that let agents bootstrap from their own experience. To explore this research further and create your own research presentations, visit EmergentMind.com.