Neural Reasoning Memory
- Reasoning Memory is a system for managing and retrieving intermediate neural states, essential for multi-step inference in tasks like dialogue, storytelling, and algorithm execution.
- Modern approaches employ differentiable external memory, episodic and neuro-symbolic structures, and retrieval-augmented methods to enhance inference efficiency and accuracy.
- Challenges remain in balancing memory fidelity, mitigating interference, and scaling architectures for reliable performance on complex, real-world reasoning tasks.
Reasoning memory is the class of mechanisms by which a neural model or agent stores, updates, retrieves, and composes remembered state in support of multi-step inference. In the foundational formulation, reasoning is not merely computation but computation over remembered state: answering questions from stories, conducting dialogs, and executing multi-step algorithms all require a process that transforms representations and a memory that holds intermediate state and long-term context (Sahu, 2017). Across the literature, the term has expanded from differentiable external memory in Memory Networks and Neural Turing Machines to retrieval-augmented commonsense inference, executive and episodic memory for long-horizon agents, event-centric and neuro-symbolic memory graphs, and inference-time memory layers that reduce token and latency costs while preserving reasoning quality (Mahajan, 2018, Qian et al., 12 Jan 2026, Shu et al., 13 Feb 2026, Patel et al., 17 Nov 2025).
1. Foundational formulation
The earliest neural treatment of reasoning memory in this corpus is architectural rather than symbolic. A simple recurrent model updates a hidden state by
thereby compressing the entire past into a single state vector. The associated backpropagation signal multiplies Jacobians across time, so gradients tend to vanish or explode; this makes long-term credit assignment fragile, especially when the task requires remembering facts from many time steps earlier. LSTMs mitigate this by introducing gated cell dynamics,
so that when and , the cell behaves like a “perfect integrator.” Even so, the model still compresses all needed facts into a fixed-size state, which becomes inadequate when the task demands precise, compartmentalized memory over many steps (Sahu, 2017).
Attention reframes memory access as differentiable retrieval rather than compression alone. Given a query , keys , and values , dot-product attention computes
This turns retrieval into a smooth lookup, allowing gradients to flow into both the query and memory contents. The core conceptual shift is the separation between “where information is kept” and “how it is processed,” which made later memory architectures possible (Sahu, 2017).
A later line of work makes the same point from a different angle: reasoning-in-a-haystack tasks fail when models must search long, distractor-heavy contexts with only implicit internal state. MemReasoner therefore places explicit episodic memory outside the decoder, learns temporal order over fact latents with a bidirectional GRU, and performs iterative hops that update the query representation, yielding strong generalization under hard distractors, soft distractors, answer remapping, and task transfer with none-to-weak supporting fact supervision (Das et al., 10 Mar 2025).
2. Differentiable external memory and retrieval-augmented reasoning
Memory Networks and Neural Turing Machines established the canonical external-memory paradigms. Memory Networks were defined with four components—I (input mapping), G (generalization), O (output or inference), and R (response)—and later End-to-End Memory Networks replaced hard supporting-fact selection with differentiable attention. In the multi-hop variant, discrete inputs and a query 0 are embedded, attention is computed by
1
and after 2 hops the answer is produced as
3
Each hop re-queries memory, allowing chaining of facts, temporal reasoning, and relational joins; in the survey, “more hops give improved performance,” and strongly supervised MemNNs markedly outperform LSTMs on synthetic multi-task QA (Sahu, 2017).
Neural Turing Machines generalized the same idea to a controller plus a large addressable memory matrix 4. Content-based addressing uses a key 5 and strength 6,
7
reading is
8
and writing uses erase/add operations,
9
Because the operations are differentiable, the controller and addressing heads can be trained end-to-end, and the survey reports that NTMs learn copying and priority sorting more robustly than stand-alone LSTMs, including generalization from training sequences of length 20 to tests of length 100 (Sahu, 2017).
A distinct extension replaces a fixed story with retrieval from an external corpus. In Top-0 Memory Candidates, a query 1 is mapped to the 2 highest-scoring documents from an indexed corpus,
3
and only the selected memories are passed to a two-hop End-to-End Memory Network. Attention within the selected set is standard,
4
but the top-5 retrieval itself is a hard, non-differentiable pre-step. On a subset of 62 Winograd Schema Challenge problems mapped to cause–effect queries, the model answered 25/62 correctly, or approximately 40.3%, while the paper also noted the absence of baselines, ablations on 6, and statistical significance tests (Mahajan, 2018).
3. Executive, distributed, and narrative memory for long-horizon agents
As LLM agents began to operate over tens of steps, memory stopped being just an addressable matrix and became an explicit control substrate for trajectory management. Three representative designs illustrate this shift.
| Family | Representative system | Memory organization |
|---|---|---|
| Executive memory | MemoBrain | Dependency-aware DAG of task, subtask, evidence, and summary nodes |
| Distributed active memory | ActiveMem | Planner plus distributed shards of distilled semantic gists |
| Narrative working memory | Amory | Episodic narratives with subplots plus semantic triples |
MemoBrain treats memory as executive control rather than passive storage. Each reasoning episode is represented as 7, where 8 are transient execution traces and 9 is the resolved semantic outcome. The episode is abstracted into a compact thought 0, added to a directed memory graph
1
Under a fixed context budget, MemoBrain applies executive operations 2, collapsing completed sub-trajectories into summaries or replacing invalid and superseded nodes with structural placeholders. Empirically, MemoBrain-8B raised GAIA average Pass@1 from 63.1 to 71.8 when paired with GLM-4.6, and from 68.9 to 74.5 with DeepResearch-30B-A3B, while also improving BrowseComp-Plus and WebWalkerQA (Qian et al., 12 Jan 2026).
ActiveMem separates the core reasoner from memory management more radically. A high-level Planner reasons over compact context, while Memorizers distill query-conditioned gists 3 into distributed Memory Shards, and an Operator handles routing, reuse, and asynchronous consolidation. The Planner state is
4
it emits retrieval tasks 5, and the interaction history is explicitly trimmed,
6
This decoupling is motivated by the claim that centralized memory forces a trade-off between context overflow and irreversible pruning-induced information loss. On BrowseComp-Plus, ActiveMem achieved LasJ 0.79 with PFLOPs 2,145 and ACT 0.785, outperforming Context-Folding at LasJ 0.72 and PFLOPs 3,920; on GAIA it reached LasJ 0.62 at PFLOPs 187, again the best reported result in its comparison set (Jiang et al., 9 Jun 2026).
Amory takes a narrative approach. It organizes conversation history into episodic narratives with headlines, subplots, characters, and timestamped leaf fragments, while peripheral non-plot facts are semanticized into triples stored in Neo4j. Retrieval is coherence-driven rather than embedding-only: an LLM selects top-7 narratives by plot, actor continuity, causal connection, and temporal consistency. Its momentum-aware “inactive consolidation” is triggered only when a narrative has not received new bindings in the previous iteration, and the paper reports that this timing improves temporal reasoning relative to both no consolidation and rapid fixed-step consolidation. On LOCOMO, the combined episodic-plus-semantic configuration reached Overall J = 87.7 versus Full Context at 86.1, with p90 latency 3.21 s versus 6.08 s (Zhou et al., 9 Jan 2026).
4. Episodic, event-centric, neuro-symbolic, and multimodal memory
A second branch of the literature insists that memory for reasoning must be explicitly time-aware and event-grounded. REMem formalizes episodic memory as a typed multigraph
8
where gist nodes store concise episode summaries with normalized time scopes, phrase nodes store concept-level elements, relation edges encode time-scoped triples, context edges bind gists to extracted phrases, and synonymy edges connect semantically similar episodes. Online inference uses a ReAct-style agent with curated tools such as semantic_retrieve, lexical_retrieve, find_gist_contexts, and find_entity_contexts, allowing iterative retrieval under explicit temporal constraints. Across four episodic memory benchmarks, REMem reports absolute gains of 3.4% on episodic recollection and 13.4% on episodic reasoning over strong baselines, and it also shows the best refusal F1 of 64.0% on LoCoMo adversarial queries (Shu et al., 13 Feb 2026).
CompassMem proposes an event graph as a “logic map.” Each event node is
9
with observations 0, temporal information 1, semantic summary 2, and participants 3. Relations are typed and explicit, including causal and temporal edges. Query-time navigation is not flat retrieval but subgoal-aware search: the Planner decomposes the query into 4, tracks which subgoals remain unsatisfied, and prioritizes candidate nodes by
5
On LoCoMo with GPT-4o-mini, CompassMem reached average F1 52.18 versus 47.92 for HippoRAG, and on NarrativeQA it exceeded CAM by more than 5% F1 with GPT-4o-mini and more than 8% F1 with Qwen2.5-14B (Hu et al., 8 Jan 2026).
NS-Mem extends structured memory into multimodal agents by combining an episodic layer, a semantic layer, and a logic rule layer,
6
Its logic nodes pair neural indices with procedural DAGs,
7
where 8 is a procedural graph and 9 contains deterministic symbolic query functions such as queryStepSequence(goal, C). SK-Gen builds these structures by extracting multimodal events, consolidating entity-centric semantics, mining sequential patterns with PrefixSpan, verifying them with an LLM, and incrementally updating both vector indices and symbolic transitions. On M3-Bench, NS-Mem reports an average 4.35% improvement in overall reasoning accuracy over pure neural memory systems, including gains of 12.5% on constrained queries (Jiang et al., 16 Mar 2026).
EgoMemReason turns the multimodal case into an explicit benchmark. It decomposes week-long egocentric memory into entity memory, event memory, and behavior memory, over 500 questions spanning six challenges, with an average of 5.1 evidence video segments and 25.9 hours of memory backtracking per question. The best overall accuracy reported is 39.6%, and performance declines as temporal certification increases, showing that long-horizon multimodal reasoning remains far from solved (Wang et al., 11 May 2026).
5. Lifelong learning, inference-time efficiency, and scalable reasoning reuse
A major contemporary theme is that reasoning memory should not only improve correctness but also prevent recomputation. ENGRAM-R defines an inference-time memory layer external to model weights. Dialogue memory is typed into episodic, semantic, and procedural stores, retrieved with a fixed evidence budget 0, rendered into compact fact cards
1
and injected into a prompt that requires explicit citation of only valid card IDs. On LoCoMo, ENGRAM-R reduced input tokens from 28,371,703 to 3,293,478 and reasoning tokens from 1,335,988 to 378,424 while maintaining judge accuracy near full-context inference; on LongMemEval_S it improved judge accuracy from 38.0% to 59.8% while also sharply reducing input and reasoning tokens (Patel et al., 17 Nov 2025).
Lifelong memory systems push beyond per-instance reuse. ReasoningBank stores distilled strategy-level memories
2
where 3 comes from self-judging. Retrieval is query-to-query cosine similarity using stored source queries, and the crucial claim is that failures are first-class memory sources rather than mere discarded rollouts. On WebArena with Gemini-2.5-Flash, ReasoningBank reached success rate 48.8 versus 40.5 for No Memory, while also reducing steps from 9.7 to 8.3; MaTTS then uses parallel or sequential test-time scaling so that richer experiences produce better memory, and better memory guides more effective scaling (Ouyang et al., 29 Sep 2025).
ArcMemo sharpens the same idea into concept-level memory. Instead of storing instance-bound traces, it extracts reusable concepts in either open-ended situation–suggestion form or typed program-synthesis form. At inference, selected concepts are injected into the induction prompt, and memory can be updated every 4 problems via a generic write–read loop. On ARC-AGI, ArcMemo-PS achieved official Oracle@2 of 59.33 versus 55.17 for the no-memory baseline, a 7.5% relative gain, and the paper reports that dynamically updating memory during test time outperforms an otherwise identical fixed-memory setting (Ho et al., 4 Sep 2025).
Memory also appears as a scaling mechanism in multi-agent inference. ReM-MoA defines a per-instance, cross-layer ranked reasoning memory
5
where proposer traces are comparatively scored by a Reviewer Agent. Curated diversified routing then exposes different agents to Top, Bottom, and Contrastive subsets of prior reasoning. In depth scaling on MATH with width 6, Standard MoA declines from 70.0 at 7 to 61.0 at 8, whereas ReM-MoA rises from 68.0 to 81.0 and the distilled-reviewer variant rises to 84.0 (Ping et al., 23 Jun 2026).
At the training level, memory can serve as intrinsic motivation. In the sub-1B regime, Memory-R9 maintains separate success and failure memories over response embeddings, defines exploitation as distance to the centroid of successful retrieved responses,
0
and exploration as dissimilarity to retrieved failed responses,
1
These normalized intrinsic rewards are combined with external correctness rewards under GRPO. The reported effect is improved sample efficiency and collapse avoidance for tiny LLMs on GSM8K and AI-MO, with Memory-R2 consistently avoiding the reward-mode and response-length collapse modes observed under simpler reward shaping (Le et al., 3 Apr 2025).
6. Evaluation challenges, mechanistic issues, and controversies
Reasoning memory is not uniformly beneficial, and a recurrent theme is that memory quality and access policy matter at least as much as capacity. The foundational survey already emphasized several persistent problems: discrete memory access is hard to train, soft addressing can become blurry and interfere across slots, and long-term credit assignment remains difficult even with differentiable retrieval (Sahu, 2017). Later work on exemplar banks makes the same point empirically: in collaborative few-shot reasoning, random exemplar selection can often beat similarity-based retrieval, and in some tasks the inclusion of any exemplars distracts both weak and strong models rather than helping them (Michelman et al., 7 Mar 2025).
A separate mechanistic result suggests that the relation between reasoning and memorization is itself partially organized in representation space. The reasoning–memorization study identifies per-layer linear reasoning features in the residual stream, computes them by difference-in-means between reasoning-intensive and memory-intensive task sets, and shows that additive steering along this direction can improve both reasoning-heavy and memory-heavy benchmarks depending on the sign of the intervention. The paper’s claim is not that memory should be removed, but that the balance between rule-based generalization and recall-like behavior can be modulated through a single residual direction (Hong et al., 29 Mar 2025).
Personalization introduces another failure mode: memory may change not just the answer but the justification trajectory. DRIFTLENS operationalizes this as memory-induced reasoning drift, maps each reasoning step to a value ontology, and compares the no-memory and memory-injected symbolic trajectories using normalized DTW and the Sequence Recurrence Index. Across four LLMs and 10 user-attribute categories, persona memory induces medium-to-large drift above the pragmatic-noise floor even when answers remain fluent and plausible, with Trans status and Disability among the highest-drift categories on most model-metric panels (Fang et al., 2 Jul 2026). This extends the notion of reasoning memory from capacity and retrieval into safety and fairness: a memory system can appear helpful while silently changing which values structure the reasoning.
The evaluation problem therefore remains unsettled. Many strong results are obtained on synthetic QA, story reasoning, constrained dialogs, or task-specific long-horizon benchmarks; several papers explicitly note that realistic datasets with longer narratives, richer relations, partial observability, and real-world noise are still needed (Sahu, 2017, Shu et al., 13 Feb 2026, Wang et al., 11 May 2026). A plausible implication is that the field is converging on two criteria for mature reasoning memory: first, memory must be structurally aligned with the inferential operations demanded by the task, whether those are temporal ordering, procedural execution, constraint satisfaction, or reflective correction; second, memory must be evaluated არა only by retrieval fidelity or end accuracy, but also by stability, interference, auditability, and the degree to which it genuinely reduces recomputation rather than merely relocating it.