Decocted Experience Improves Test-Time Inference in LLM Agents
This presentation explores how language model agents can dramatically improve their performance not by scaling up computation, but by learning to construct better input contexts from their own experience. The work introduces experience decoction—a systematic approach to distilling, organizing, and retrieving the most informative parts of an agent's accumulated knowledge. Through rigorous experiments across mathematical reasoning, web interaction, and software engineering tasks, the research demonstrates that well-designed memory systems can achieve substantial performance gains at test time without any model retraining, revealing the critical but often overlooked role of context quality in agentic reasoning.Script
What if the key to smarter language model agents isn't more computation during inference, but better memory? This work reveals that agents can dramatically improve their reasoning by learning what to remember and how to organize it.
Traditional approaches either rely on manually crafted prompts that fail to generalize, or dump entire agent histories into context windows where they drown in noise. The researchers formalize a third path: systematically extracting the informational essence from self-acquired agent experience.
How do you transform messy, environment-coupled trajectories into compact, reusable knowledge?
For tasks like web navigation and software engineering, where interactions are lengthy and partially observable, distilled lessons outperform raw traces by orders of magnitude in both efficiency and effectiveness. In math reasoning, where detailed traces already capture most cues, the gap narrows—but distillation never hurts.
Flat memory retrieval fails because similarity alone breeds redundancy. The authors introduce hierarchical concept trees that group lessons by topic, then re-rank for diversity. This structure turns out to be essential: empirical evidence shows a strong linear relationship between information gain from context and agent effectiveness, with diversity as the missing ingredient.
The implications are striking. By focusing on what the agent remembers and how it organizes that memory, the work achieves measurable improvements across diverse tasks without touching model weights. This reframes test-time inference as a memory design problem, not just a compute scaling challenge.
Experience decoction shows us that the sharpest agents may not be those that think harder, but those that remember smarter. Visit EmergentMind.com to explore this paper further and create your own research video.