Decocted Experience Improves Test-Time Inference in LLM Agents
Abstract: There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Clear, simple explanation of the paper “Decocted Experience Improves Test-Time Inference in LLM Agents”
Overview: What is this paper about?
This paper looks at how to make AI “agents” (LLMs that can plan, decide, and use tools) smarter without changing their brains (their parameters). Instead of making them think longer at test time, the authors show we can make them better by giving them better “context” — the information we put in their prompt. Their main idea is to use “decocted experience”: boil down past experiences into short, useful lessons, organize those lessons well, and fetch the right ones when a new problem appears.
Think of it like this: rather than giving a robot a huge diary of everything it ever did, give it a tidy notebook of tips it learned, file those tips in a good system, and pull out the most helpful ones right before it acts.
Key objectives: What questions did they ask?
The authors set out to answer three simple questions:
- How should an AI turn past experiences into helpful context for new tasks?
- As the AI gathers more and more experience, how should it store and organize those experiences so it doesn’t get overwhelmed?
- What makes a “good” context, and how can we build memory structures that find it reliably?
Methods: How did they study it?
They tested agents on three kinds of tasks:
- Math reasoning: solving problems that require step-by-step thinking.
- Web browsing (WebShop): shopping online to meet a user’s request.
- Software engineering (SWE-bench): fixing real bugs in code.
Here’s the approach, explained in everyday terms:
- Collect experience: Let the agent try tasks and record what it did and what worked.
- Distill lessons (“decoction”): Instead of keeping long, noisy action logs, the agent summarizes each successful trajectory into a short, reusable lesson (like a tip or strategy).
- Build memory:
- Simple version: Keep a flat list of problems and their distilled lessons.
- Smarter version: Organize lessons by “concept groups” in a tree structure (like a library with sections → shelves → books).
- Retrieve context: When a new problem arrives, find the most relevant and diverse lessons to put into the prompt. They tried:
- Top-K similarity (grab the K most similar lessons).
- Concept tree retrieval (pick lessons from several relevant concept groups, then let the model re-rank them).
They also analyzed why good context helps:
- Information idea: If the context gives useful hints, the AI can reach a solution with fewer steps (like being told “use the Pythagorean theorem” before starting a geometry problem). They linked this to “information gain”: more helpful context lowers the uncertainty and shortens the answer.
Main findings: What did they discover and why is it important?
Here are the main takeaways, explained simply:
- Distilled lessons beat raw experience in messy, real-world tasks.
- For web shopping and coding, short lessons worked better than full action logs. Long logs include lots of irrelevant details (like every page click), which can distract the model. Lessons kept the important strategies and dropped the noise.
- For math, detailed examples were sometimes slightly better, likely because math benefits from seeing full reasoning steps.
- Better context doesn’t have to be bigger.
- Adding more and more raw experience to the prompt can actually hurt performance if the prompt gets too long and noisy.
- Distilled lessons achieved strong performance with fewer input tokens (shorter prompts), so the agent was both more accurate and more efficient.
- There’s a “sweet spot” for how much memory to keep.
- If you compress memory too much, you lose important variety.
- If you keep everything, you get too many near-duplicates, which hurts generalization.
- Clustering and keeping representative lessons from each group gave the best balance: relevant and diverse.
- What makes a good context?
- It should be both relevant (close to the new problem) and diverse (not just copies of the same kind of tip).
- Too much similarity = redundancy; too much diversity = off-topic. The best performance came from balancing both.
- A tree-structured memory helps retrieval.
- Organizing lessons into a “concept tree” (broad topics → finer subtopics) helped the agent pick relevant lessons from different groups, increasing diversity without losing relevance.
- This improved performance compared to a flat memory that only used similarity search.
- Why this works (intuition from information theory):
- More informative context reduces the number of steps the model needs to reach an answer. The paper shows a tight link between “how uncertain the model is after reading the context” and “how long its solution is.” Less uncertainty → shorter, more efficient solutions.
Implications: Why does this matter?
- Smarter prompts, not just longer thinking: Instead of making the AI spend more time thinking at test time (which can be slow and expensive), we can make it better by feeding it the right distilled lessons.
- Practical for real-world agents: For tasks like browsing the web or fixing code, this approach makes agents faster and more reliable, because they reuse what actually worked before.
- Scales with experience: As agents gather more experience, organizing it well (especially with a concept tree) prevents overload and keeps retrieval sharp.
- A roadmap for future AI agents: This suggests future systems should:
- Continuously learn from their own successes,
- “Boil down” those experiences into tips,
- Store them in structured memory,
- And fetch the best mix of relevant and diverse lessons for each new problem.
In short, the paper shows that teaching AI to learn and organize its own “tips and tricks” can make it significantly better at solving new problems quickly and accurately—without changing its core model.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable follow-up work:
- Limited task coverage and domains: results are shown on math reasoning, WebShop, and SWE-bench; generalization to other domains (e.g., scientific QA, legal, medical), longer-horizon planning, and multi-modal or tool-heavy tasks remains untested.
- Model diversity: evaluations focus on two open-source LLMs; robustness across different model sizes, architectures, and instruction-tuning regimes is not established.
- Language and locale generality: experiments are in English; performance for multilingual settings, locale-specific tasks, and cross-lingual transfer is unknown.
- When to use raw trajectories vs distilled lessons: math benefits from raw traces while agentic tasks prefer lessons; there is no method to decide per task/query which representation to use, nor a hybrid policy that adapts on the fly.
- Use of failures and near-misses: memory keeps one successful trajectory per problem; how to leverage failed trajectories to encode pitfalls or counter-examples is left unexplored.
- Negative transfer and noise: the impact of low-quality, noisy, or domain-misaligned experience (including adversarial or poisoned lessons) on retrieval and performance is not analyzed.
- Safety, calibration, and hallucinations: effects of decocted context on safety risks, hallucination rates, action reliability, and confidence calibration are not reported.
- Continual/online operation: the system is evaluated offline; mechanisms to incrementally update lessons, clusters, and trees without full rebuilds (and to avoid catastrophic forgetting) are not developed.
- Memory scaling beyond tens of thousands: consolidation and retrieval are tested up to ~14k entries; scalability to orders of magnitude larger memories (e.g., ≥106) and distributed/sharded stores is unproven.
- Selecting the “sweet spot” memory size: while an intermediate consolidation level performs best, there is no procedure to automatically identify or track this optimum as the memory evolves.
- Hyperparameter sensitivity: performance dependence on K (retrieved items), α (embedding mix), λ (relevance–diversity), tree depth/branching, and candidate pool size lacks systematic sensitivity analyses or tuning guidance.
- Retrieval method comparisons: the approach uses simple similarity search and LLM re-ranking; comparisons to MMR, learned cross-encoder rerankers, and advanced dense/sparse hybrid retrieval are missing.
- Embedding dependence: results hinge on one embedding model; robustness to embedding quality, domain shift, and alternative encoders (or task-specific retrievers) is not evaluated.
- Re-ranking cost vs benefit: LLM-based re-ranking in the concept tree is compute-heavy; ablations quantifying end-to-end latency, token cost, and wall-clock trade-offs are absent.
- End-to-end cost accounting: token counts/steps are proxies; there is no comprehensive budget including distillation time, clustering/tree build and maintenance, retrieval, and re-ranking overhead.
- Context formatting constraints: only concatenation-based prompts are used; potential gains from structured formats (schemas, tool-aware templates, pointers/refs) are left unexplored.
- Lesson distillation procedure: the distillation relies on the same base model; effects of teacher–student distillation, lesson length control, standardization, and quality assurance of distilled content are not studied.
- Theory applicability: Proposition 4.1 assumes a per-token entropy lower bound and validates only on math; its applicability to agentic trajectories (tool calls, observations) and non-iid token streams is unverified.
- Estimating information gain at test-time: the paper uses Monte Carlo entropy estimates post hoc; practical, low-cost pre-generation proxies for information gain are not developed.
- Modest correlation of proxy metric: the relevance–diversity proxy correlates weakly-to-moderately with gains (r≈0.25); methods to strengthen or learn better proxies and to directly optimize retrieval for the trade-off are missing.
- Tree construction and maintenance: concept extraction quality (LLM-generated descriptors), clustering stability, and robustness to errors/drift in concepts are not assessed; incremental tree updates and rebalancing policies are open.
- Cross-agent and cross-model transfer: experience is (mostly) self-experience; how to share, adapt, or filter experience across different agents/models without negative transfer is unclear.
- Privacy and compliance: storing and reusing trajectories and environment logs may expose sensitive data; policies for redaction, differential privacy, or compliance are not addressed.
- Robustness to distribution shift: how decocted memory performs on out-of-domain or novel tasks, and mechanisms for OOD detection and retrieval fallback, remain open.
- Evaluation rigor: significance tests, variance across random seeds, and error analyses (e.g., where lessons mislead) are not reported, leaving uncertainty about reliability.
- Data leakage safeguards: while train–test splits are stated, explicit checks for overlap or leakage between memory and evaluation (especially in math/code) are not described.
- Heterogeneous memory content: SWE involves code patches and tests; general methods for normalizing, chunking, and retrieving mixed modalities (code, HTML, tool traces) are not developed.
- Adaptive retrieval depth/width: K and ne are fixed; policies that adapt the number of lessons/leaves per query based on uncertainty or early signals are uninvestigated.
- Combining with test-time compute scaling: interactions between decocted context and sampling/search depth (e.g., CoT length, DFS/BFS) are not systematically studied to find joint optima.
- Failure handling at inference: mechanisms to detect when retrieved context is misleading and to recover (e.g., retry with different concept leaves or suppress context) are not provided.
- Benchmarking against strong memory systems: empirical comparisons to established memory frameworks (e.g., hierarchical memory OSs or RL-optimized memory) are absent, limiting understanding of relative gains.
Practical Applications
Below are actionable, real‑world applications derived from the paper’s findings on “decocted experience” (lesson distillation, memory consolidation, and concept‑tree retrieval) for improving LLM agent performance at test time. Each item notes target sectors, potential tools/workflows, and assumptions/dependencies.
Immediate Applications
These can be deployed with current LLMs, embeddings, and standard RAG/agent stacks.
- Software engineering agents (software)
- Use case: Improve code‑fixing, triage, and CI bot reliability by replacing long raw traces with distilled “lessons” from past tickets (e.g., SWE‑bench‑style patches) and retrieving them via concept‑tree memory.
- Tools/workflows:
- “Lesson Distiller” that converts successful fix trajectories into reusable patterns.
- Clustering-based memory consolidation to maintain a “sweet spot” memory size.
- LLM re‑ranking of candidate lessons retrieved from concept leaves.
- Assumptions/dependencies: Access to historical PRs/issues and execution logs; stable tool‑calling; high‑quality embeddings; governance for code/IP.
- E‑commerce browsing and shopping assistants (retail/e‑commerce)
- Use case: More efficient, higher‑quality product finding and decision logic by distilling browsing sessions into lessons (e.g., ranking heuristics, vendor quirks) and retrieving conceptually diverse guidance.
- Tools/workflows: Concept tree built from category/subcategory “concepts” extracted from sessions; Top‑K lesson retrieval with LLM re‑ranking; context budget tuning to avoid prompt bloat.
- Assumptions/dependencies: Consent for logging sessions; product taxonomy alignment; embedding coverage across long‑tail SKUs.
- Enterprise knowledge agents with “memory OS” upgrades (software, general enterprise)
- Use case: Replace noisy conversation or workflow logs with lesson‑level memory for faster, cheaper test‑time inference in helpdesk, onboarding, and SOP assistants.
- Tools/workflows:
- Consolidate memory via k‑means to reduce redundancy while preserving diversity.
- Monitor relevance‑diversity proxy to tune retrieval and memory size in production.
- Assumptions/dependencies: Data retention and privacy policies; existing RAG infrastructure; periodic offline re‑clustering.
- Customer support and ticket triage (CX/operations)
- Use case: Faster resolution scripts by retrieving distilled “what worked” lessons for similar issues rather than full historical transcripts.
- Tools/workflows: Ticket log ingestion → lesson distillation → concept‑tree retrieval; lightweight dashboards for diversity vs. relevance score tracking.
- Assumptions/dependencies: Sufficient successful case coverage; labeling for outcomes; PII redaction.
- Education and tutoring systems (education)
- Use case: Personalized hints and worked examples selected from distilled lessons spanning diverse problem concepts, reducing token lengths while maintaining solution quality.
- Tools/workflows: Concept extraction per lesson (topic | problem pattern | technique); lesson bank curated for curricular standards; K tuning to maintain a diversity balance.
- Assumptions/dependencies: Alignment with pedagogy; safeguards against hallucinated steps; content licensing.
- Compliance and legal research assistants (legal/regulated industries)
- Use case: Retrieve distilled precedent “reasoning patterns” (e.g., how prior filings solved specific compliance hurdles) to guide drafting and review with fewer tokens.
- Tools/workflows: Ingestion of annotated filings → lesson distillation → hierarchical retrieval across topics/arguments; LLM re‑ranking to ensure relevance.
- Assumptions/dependencies: Secure handling of confidential documents; expert validation of distilled lessons.
- Financial research and operations bots (finance)
- Use case: Produce consistent research notes and workflow decisions (e.g., data‑cleaning, reconciliation) by drawing on distilled, reusable “playbooks” from past tasks.
- Tools/workflows: Lesson banks keyed by asset class/task; consolidation to avoid overspecialization; output‑length monitoring as a proxy for information gain improvements.
- Assumptions/dependencies: Audit trails; model risk management; access to representative historical episodes.
- Process automation/RPA copilots (enterprise operations)
- Use case: Convert successful automations (click paths, tool sequences) into generalizable lessons; retrieve diverse workflows to improve robustness to UI changes.
- Tools/workflows: Action‑observation logs → lesson distillation removing UI noise; concept trees across applications/steps; LLM re‑ranking before execution.
- Assumptions/dependencies: Stable tool interfaces; permissioned logging; fallback policies for mis‑retrievals.
- Healthcare admin and documentation assistants (healthcare; near‑term for non‑clinical decisions)
- Use case: Streamline documentation, coding, and scheduling by retrieving distilled administrative lessons; reduce interaction steps in EHR navigation.
- Tools/workflows: Admin/EHR interaction logs → lesson distillation; concept trees aligned with visit types and coding patterns.
- Assumptions/dependencies: Strict PHI handling; limit to non‑clinical or clinician‑in‑the‑loop use; institution consent.
- Developer tools for context engineering (software tooling)
- Use case: Off‑the‑shelf “Decocted Memory” SDK that adds lesson distillation, consolidation, concept‑tree building, and relevance‑diversity scoring to existing LangChain/LlamaIndex stacks.
- Tools/workflows:
- Embedding + k‑means modules; lesson schema templates; LLM re‑ranker.
- Monitoring of K, input token budget, and retrieval diversity to optimize cost/performance.
- Assumptions/dependencies: Compatible base LLM with tool‑use/CoT; observability hooks for output lengths and success rates.
- Personal digital assistants (daily life)
- Use case: Better travel planning, shopping, and routine management by reusing distilled lessons from past successes while avoiding long, noisy histories.
- Tools/workflows: On‑device/consumer cloud lesson stores; small concept trees per domain (travel, recipes, budgets).
- Assumptions/dependencies: User consent; privacy controls; lightweight embeddings for edge deployment.
Long‑Term Applications
These require additional research, scaling, or safeguards (e.g., robust lesson accuracy, autonomy, privacy).
- Continual, closed‑loop self‑evolving agents (software, cross‑sector)
- Application: Agents that continuously collect experience, distill lessons, reorganize memory, and adapt retrieval policies online without parameter updates.
- Dependencies: Reliable online evaluation; drift detection; safeguards against compounding errors.
- RL‑optimized memory management (software, robotics)
- Application: Learn consolidation and retrieval policies (including K and tree navigation) via reinforcement learning to maximize task reward under token budgets.
- Dependencies: Stable reward signals; offline simulators; safety constraints for exploration.
- Safety‑aware clinical decision support (healthcare)
- Application: Bring decocted experience to clinician‑facing reasoning (care pathways, differential diagnoses) with rigorous governance and supervision.
- Dependencies: High‑fidelity, audited clinical logs; bias and error analysis; regulatory approvals (FDA/CE).
- Robotics and embodied agents (robotics)
- Application: Distill episodic control experiences into reusable “procedural lessons” for planning and recovery behaviors; retrieve diverse concepts to handle edge cases.
- Dependencies: Sensor/action abstraction into text concepts; sim‑to‑real transfer; latency constraints.
- Multi‑agent shared lesson repositories (enterprise ecosystems)
- Application: Organizational “lesson banks” that different agents query, with access control and provenance, to accelerate cross‑team problem solving.
- Dependencies: Standardized lesson schemas; permissioning; attribution and versioning.
- Privacy‑preserving lesson distillation (all sectors)
- Application: Federated or differential‑privacy mechanisms that allow agents to learn from experience while protecting sensitive data.
- Dependencies: DP accounting for LLM outputs; secure multi‑party embeddings/clustering.
- Information‑theoretic control of inference budgets (software/finance/healthcare)
- Application: Use entropy‑based estimates (proxying information gain) to adapt stopping criteria and allocate compute dynamically across queries.
- Dependencies: Calibrated token‑level probabilities; robust entropy estimators across domains.
- Standards for “concept” and “lesson” metadata (policy, industry consortia)
- Application: Interoperable schemas for concept descriptions (topic | pattern | technique), provenance, and outcome labels to enable tooling and auditing.
- Dependencies: Cross‑vendor coordination; mappings to existing taxonomies (e.g., SNOMED, NAICS).
- On‑device/edge concept trees (mobile/IoT)
- Application: Lightweight, private memory structures for personal and industrial agents operating with limited connectivity.
- Dependencies: Compact embeddings; incremental clustering; energy constraints.
- Regulatory and audit tooling (policy/compliance)
- Application: Supervisory dashboards that track relevance‑diversity trade‑offs, memory size “sweet spots,” and lesson provenance for compliance and risk management.
- Dependencies: Logging standards; explainability of retrieval and distillation decisions.
Cross‑cutting assumptions and dependencies
- Data availability and quality: Access to self‑experience logs with clear outcomes; noise filtering is crucial for accurate lesson distillation.
- Model and infra constraints:
- Base LLM must support tool use/CoT; reliable token‑probability outputs help estimate entropy proxies.
- High‑quality embeddings and re‑rankers are needed; k‑means/bi‑secting clustering can be run offline at scale.
- Context budget management: Must tune K and consolidate memory to avoid long, noisy prompts; monitor the relevance‑diversity proxy and output lengths.
- Governance and ethics: Privacy, IP, and safety oversight for storing and reusing experience; human‑in‑the‑loop for high‑stakes decisions.
- Generalization vs. specialization: Maintain diversity to avoid overfitting to narrow patterns; periodically reassess the “sweet spot” memory size as domains evolve.
These applications leverage the paper’s central insight: extracting, organizing, and retrieving the “essence” of past interactions as concise, diverse, and relevant lessons yields more effective and efficient LLM agent inference than naively scaling test‑time compute or dumping raw trajectories into context.
Glossary
- Agentic tasks: Tasks where an LLM acts as an agent interacting with tools/environments over multiple steps, requiring sequential decision-making. "for complex reasoning and agentic tasks,"
- Chain-of-thought (CoT) prompting: A prompting method that elicits step-by-step reasoning before answers to improve performance on reasoning tasks. "Chain-of-thought (CoT) prompting (Wei et al., 2022) further shows that multi-step reasoning can substantially improve performance on reasoning tasks."
- Concatenation-based construction: Building the model’s input context by directly concatenating the query with retrieved items. "we assume a simple concatenation-based construction."
- Conditional entropy: The uncertainty of the output distribution given specific inputs (e.g., query and context). "We empirically validate Proposition 4.1 by estimating the conditional entropy Ĥ(Y | x,c)"
- Context engineering: The practice of optimizing what information is fed to an LLM at inference time (e.g., retrieval, formatting), rather than changing model weights. "Context engineering can be viewed as optimizing the information fed into an LLM at inference time"
- Cosine similarity: A similarity metric between vectors based on the cosine of the angle between them, commonly used with embeddings. "using recursive bisecting k-means with cosine similarity."
- Decocted experience: Processed and condensed experience that captures essential, reusable strategies for building effective context. "We identify decocted experience as a key mechanism for effective context construction"
- Embedding model: A model that converts text into vector representations for similarity and retrieval. "We use Qwen3-Embedding-4B (Zhang et al., 2025d) as the embedding model Te."
- Experience-augmented agents: Agents that leverage stored past interactions (experience) to construct context for improved test-time inference. "We present a detailed analysis of experience-augmented agents,"
- Experience decoction: Transforming raw experience by extracting essence, organizing it, and retrieving salient information for context. "experience decoction through lesson distillation"
- Hierarchical concept tree: A tree-structured memory that clusters lessons by concepts at multiple granularities to support diverse-yet-relevant retrieval. "we propose a structured memory called a hierarchical concept tree"
- Information gain: The reduction in output uncertainty due to observing the context, often defined via entropy differences. "information gain of context c for query x defined as"
- k-means: A clustering algorithm that partitions data points (e.g., embeddings) into k clusters by minimizing within-cluster variance. "using standard k-means."
- Lesson distillation: Compressing detailed trajectories into succinct, reusable “lessons” that capture transferable reasoning patterns. "a compression mechanism that we refer to as lesson distillation."
- Memory consolidation: Compressing and organizing stored experience to reduce redundancy while preserving diversity for effective retrieval. "Memory consolidation achieves a sweet spot at intermediate sizes."
- Monte Carlo estimator: A statistical estimator that uses sampling—in this case token log-probabilities—to approximate quantities like entropy. "we compute a Monte Carlo estimator of the conditional entropy"
- Mutual information: An information-theoretic measure quantifying dependency between variables (e.g., outputs and context). "the per-query mutual information I(Y; C | X = x)"
- Pearson correlation: A statistic measuring linear correlation between two variables, ranging from -1 to 1. "The positive Pearson correlation (r = 0.25)"
- Retrieval-augmented generation (RAG): Enhancing generation by retrieving and injecting relevant external knowledge into the prompt. "retrieval-augmented generation (RAG) (Lewis et al., 2020; Gao et al., 2023; Qian et al., 2025)"
- Retrieval relevance: A metric assessing how semantically similar retrieved items are to the query. "we consider a new metric Avg. Retrieval Relevance"
- Semantic retrieval: Selecting context items based on semantic similarity in embedding space rather than lexical overlap. "using semantic retrieval."
- Stopping time: A (random) time at which a process stops; here, when generation emits the first EOS token. "let t denote the stopping time"
- Test-time scaling: Improving performance by allocating more computation during inference (e.g., longer reasoning, more samples, search). "One well-established direction is test-time scaling"
- Top-K retrieval: Selecting the K highest-scoring (e.g., most similar) items from memory for inclusion in the prompt. "increasing K in Top-K retrieval."
Collections
Sign up for free to add this paper to one or more collections.