MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

This presentation introduces MemRL, a groundbreaking framework that enables language model agents to learn and improve at runtime without modifying their core parameters. By treating episodic memory as a reinforcement learning substrate, MemRL assigns utility scores to past experiences and retrieves them based on both semantic similarity and proven effectiveness. The approach demonstrates substantial performance gains across code generation, embodied navigation, and complex problem-solving benchmarks while avoiding catastrophic forgetting—a persistent challenge in continual learning systems.
Script
What if an agent could learn from every interaction without ever forgetting what it already knows? Today we explore how MemRL solves this challenge by turning episodic memory into a reinforcement learning playground.
Let's first understand why existing approaches fall short.
Building on that challenge, the authors identify a critical gap: standard retrieval methods grab semantically similar memories without asking whether those memories actually worked. Meanwhile, fine-tuning risks destroying what the model already knows.
So how does MemRL break through these limitations?
The core insight is elegant: decouple what you know from what you learn. MemRL keeps the language model untouched while building a dynamic memory system that tracks which past solutions actually deliver results.
This brings us to the retrieval mechanism itself, which operates in two complementary phases. First, semantic recall narrows the search space, then value-aware selection prioritizes memories that have demonstrably solved similar problems before.
When an agent completes a task, MemRL updates the utility estimate for any memory it used, gradually learning which experiences transfer well. Because only memory scores change, never model parameters, previously mastered skills remain intact.
Now let's examine how this plays out across diverse benchmarks.
The results are striking across the board. On sequential reasoning tasks like ALFWorld, MemRL achieves a 56 percent relative improvement, and its utility estimates closely track which memories actually lead to success, acting as an effective trajectory verifier.
These gains rest on solid theoretical foundations: the authors prove that utility estimates converge and remain stable, while empirically the system promotes transfer of genuinely useful strategies rather than superficially similar failures.
MemRL transforms memory from a passive archive into an active learning substrate, enabling agents to evolve safely at runtime by remembering not just what happened, but what actually worked. Visit EmergentMind.com to explore the full paper and dive deeper into this paradigm shift in agentic learning.