ReasoningBank: Memory-Driven Reasoning Framework

Updated 5 October 2025

ReasoningBank is a memory-driven framework that abstracts high-level reasoning signals from both successful and failed agent interactions for improved adaptability.
It employs a dual-phase mechanism to distill strategies from agent outcomes, updating its repository continually for better task performance.
Empirical evaluations reveal up to an 8% success rate increase and reduced interaction steps, demonstrating the efficiency of its memory-aware test-time scaling.

ReasoningBank is a memory-driven framework and benchmark suite for scaling the self-evolution of LLM agents through the distillation, retrieval, and continual consolidation of generalizable reasoning strategies derived from both successful and failed agent interactions. It is motivated by the need for LLM agents operating in persistent real-world contexts to not simply replay or accumulate raw experience, but to abstract actionable, transferable reasoning knowledge that can be exploited and improved as these agents encounter continuous streams of diverse tasks.

1. Conceptual Framework

At its core, ReasoningBank departs from traditional memory mechanisms which either store raw trajectories or successful routines by instead extracting high-level reasoning signals—including decision rationales, failure modes, and operational insights—from both agent successes and failures. Each unit of reasoning memory is structured as a triple:

Title: concise summary of the underlying reasoning strategy,
Description: a brief statement of context or use-case,
Content: the detailed distilled reasoning step(s) or operational procedures.

This format is machine-usable (enabling in-context retrieval for LLM inference) and human-interpretable (supporting model introspection, debugging, and refinement of reasoning strategies). The extraction pipeline explicitly targets abstractions that are not tied to specific surface forms but generalize across tasks. The memory bank thus acts as a repository that can be continually updated and referenced.

2. Mechanism and Memory Maintenance

The operational cycle of ReasoningBank involves several key phases:

Memory Retrieval: Upon encountering a new task, the agent queries ReasoningBank for previously distilled reasoning items using embedding-based similarity search (e.g., via Gemini embeddings). Relevant memory items are injected into the agent’s context to guide inference.
Self-Judgment and Extraction: Once task interaction is complete, an LLM-as-a-Judge pipeline determines the trajectory’s outcome (success or failure).
Distillation from Success: For successes, the system analyzes the trajectory to abstract the underlying reasoning strategies that led to the positive outcome, emphasizing transferability.
Distillation from Failure: For failures, the system conducts reflective analysis to diagnose the causes and derive lessons or prevention strategies—these are also incorporated as memory items.
Memory Update: New distilled reasoning items are appended to ReasoningBank for future retrieval. The process is formalized as

$M \leftarrow M \cup \{\text{extracted memory items}\}$

This closed-loop, dual-signal (success and failure) memory mechanism ensures that ReasoningBank’s contents evolve in both coverage and abstraction as agent experience accumulates.

3. Memory-Aware Test-Time Scaling (MaTTS)

To accelerate and diversify the learning of effective reasoning strategies, ReasoningBank introduces the MaTTS (Memory-aware Test-time Scaling) methodology. MaTTS leverages additional computational budget during inference to generate abundant, diverse agent experiences per task (scaling factor $k$ ), which then provide strong contrastive signals for memory distillation:

Parallel Scaling: The agent generates multiple independent trajectories using varying memory contexts, facilitating cross-comparison and aggregation of common reasoning motifs.
Sequential Scaling: The agent performs iterative self-refinement within a trajectory, using intermediate signal checkpoints to further distill and consolidate reasoning procedures.

This iterative, memory-guided exploration provides a positive feedback loop—better and more diverse experience yields more informative reasoning memory, which in turn guides more effective scaling in subsequent inference. The result is substantial improvement in both effectiveness and efficiency compared to prior approaches that only record raw experience or successful paths.

4. Empirical Evaluation and Performance

ReasoningBank and MaTTS have been systematically evaluated on web-based and software engineering benchmarks:

Web-based tasks (e.g., WebArena, Mind2Web) span domains like shopping, administration, coding (Gitlab), forums (Reddit), and composite “Multi” settings.
Software engineering tasks include repository-level issue resolution in SWE-Bench-Verified.

Key metrics include:

Success Rate: proportion of fully completed tasks
Efficiency: average interaction steps per task
Element Accuracy, Action F1, Step Success Rate, Task Success Rate (for Mind2Web)

Quantitative results show that ReasoningBank increases success rates by up to 8% over baselines and reduces step counts. It outperforms state-of-the-art memory mechanisms that simply store raw trajectories or success-only routines, with consistent gains observed across diverse domains and in generalization settings.

5. Emergent Behaviors and Self-Evolving Reasoning

With continued operation, the strategies in ReasoningBank progressively shift from basic, procedural heuristics (“find navigation links”) to complex, compositional, and cross-task strategies (such as “reverify identifiers if initial attempt fails” or “cross-reference multiple requirements before next action”). As the memory bank expands, agent behavior exhibits:

Adaptive Self-Reflection: Incorporation of failure-derived lessons enables agents to avoid repeating past mistakes.
Compositional Reasoning: Later-stage strategies combine multiple memory items for novel scenarios.
Robust Decision-Making: Decision quality continually improves as more nuanced, transferable reasoning is consolidated.

These emergent behaviors indicate that ReasoningBank serves not just as storage but as a dynamic substrate for self-improvement.

6. Implications and Future Directions

ReasoningBank establishes a new scaling dimension, termed memory-driven experience scaling (Editor's term), which complements model parameter and compute scaling. Its mechanisms enable agents to:

Continuously evolve without external supervision by distilling and utilizing both successful and unsuccessful experience.
Generalize more effectively across domains, as lessons abstracted from one context can inform strategy in another.
Support compositional and hierarchy-aware reasoning by enabling dynamic combination of memory items.

The framework’s current limitations include reliance on LLM-as-a-Judge for trajectory outcome determination, representing an avenue for further research into robust automatic verification and human-in-the-loop systems. Future directions suggested include integrating episodic or hierarchical memories and advanced retrieval/consolidation strategies for scaling up automatic agent self-improvement.

7. Context within Broader Reasoning Evaluation

ReasoningBank can be situated among a new generation of benchmarks and frameworks (e.g., R3/TRMR (Wang et al., 2020), BaRDa (Clark et al., 2023), FINEREASON (Chen et al., 27 Feb 2025), JRDB-Reasoning (Jahangard et al., 14 Aug 2025)) that encode not only task completion but the explicit reasoning or error-correction process, promoting transparent and diagnostic assessment. It uniquely targets the continual abstraction and reuse of reasoning strategies—a core requirement for persistent, high-autonomy, LLM agents deployed across dynamic, real-world tasks.