Reasoning in Memory (RiM)

Updated 4 July 2026

RiM is a research paradigm where memory is actively used as a substrate for inference, rather than as passive storage.
It encompasses various methods, including episodic graphs, navigable conversational memory, and latent fixed blocks, to interleave memory operations with reasoning.
Empirical results demonstrate improved performance on tasks like math and logical reasoning, though challenges such as bias and scaling remain.

Reasoning in Memory (RiM) is a research paradigm in which memory is not treated as a passive store or a post hoc retrieval source, but as an active substrate for inference. Across recent work, RiM denotes a family of mechanisms in which models read from, write to, rank, traverse, compress, or manipulate structured memory while solving a problem. In this sense, RiM spans per-instance ranked reasoning traces in multi-agent pipelines, episodic memory graphs for language agents, navigable associative stores for conversational systems, dependency-aware executive memory for long-horizon tool use, and fixed latent “working-memory” blocks that replace generated intermediate thoughts (Ping et al., 23 Jun 2026, Shu et al., 13 Feb 2026, Li et al., 27 May 2026, Aichberger et al., 28 May 2026).

1. Conceptual scope and defining distinctions

RiM is best understood as a shift from retrieve-then-reason to reason through memory. In "ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling" (Ping et al., 23 Jun 2026), RiM is the idea that collective reasoning among multiple LLM agents is explicitly mediated by a shared, structured memory of reasoning traces within a single problem instance, and that agents at deeper layers actively read from and write to this memory as they continue to reason. In that work, RiM is realized by the Ranked Reasoning Memory (ReM), and the paper explicitly distinguishes this from retrieval-augmented generation, which pulls external corpus evidence, and from scratchpad or chain-of-thought, which captures a single agent’s ephemeral internal reasoning for one pass.

Other papers generalize the same principle in different directions. "REMem: Reasoning with Episodic Memory in Language Agent" (Shu et al., 13 Feb 2026) frames the problem as reasoning over concrete past experiences along a spatiotemporal axis, rather than over de-contextualized semantic knowledge. "MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents" (Li et al., 27 May 2026) contrasts a Memory-as-Tool paradigm, where a single query produces a one-shot retrieval of a flat list of passages, with a Memory-as-Cognition paradigm, where retrieval and reasoning are interleaved and inseparable. "Unlocking the Working Memory of LLMs for Latent Reasoning" (Aichberger et al., 28 May 2026) pushes the term further by defining RiM as latent reasoning with fixed memory blocks that substitute for generated intermediate thoughts.

A useful unifying distinction is that RiM systems make memory operational during inference. This differs from static long-term storage, passive vector retrieval, or simple prompt replay. Earlier antecedents already exposed parts of this pattern. "Top k Memory Candidates in Memory Networks for Common Sense Reasoning" (Mahajan, 2018) dynamically selected top- $k$ external memory candidates and reasoned over them with multi-hop memory attention, while "Memory-Augmented Theory of Mind Network" (Nguyen et al., 2023) used key–value episodic memory and hierarchical attention to infer beliefs, intentions, and future actions over long horizons. These systems did not yet present the broader contemporary RiM vocabulary, but they already treated memory as a computational object rather than a mere archive.

The contemporary literature therefore uses RiM in a polysemous but coherent way. The common commitment is that the model’s reasoning state is externalized, structured, or constrained by memory, and that inference quality depends on how that memory is organized and queried.

2. Memory representations as substrates for reasoning

Recent RiM systems differ most sharply in how they represent memory. The representational choice determines what kinds of inference are expressible.

RiM instantiation	Representative paper	Core memory object
Per-instance reasoning memory	ReM-MoA (Ping et al., 23 Jun 2026)	Trace-score-rationale triples
Episodic graph memory	REMem (Shu et al., 13 Feb 2026)	Gist nodes, phrase nodes, typed edges
Navigable conversational memory	MemCog (Li et al., 27 May 2026)	Dimensions, pages, sections, typed links
Executive trajectory memory	MemoBrain (Qian et al., 12 Jan 2026)	Dependency-aware thought graph
Latent working memory	RiM latent method (Aichberger et al., 28 May 2026)	Fixed memory blocks of special tokens

In ReM-MoA, the memory is append-only and per-instance. After each layer $l$ , the system commits

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$

so deeper layers can access $N \cdot l$ trace-score-rationale triples (Ping et al., 23 Jun 2026). The stored object is not just a trace, but a judgment-augmented trace. This makes memory explicitly evaluative.

REMem uses a typed hybrid memory graph

$\mathcal{M}=(V,E),\quad V = V_{\text{gist}} \cup V_{\text{phrase}},\quad E = E_{\text{rel}} \cup E_{\text{ctx}} \cup E_{\text{syn}},$

where gist nodes encode time-aware episode summaries, phrase nodes encode fact-level units, relation edges carry temporally qualified triples, context edges bind gists to phrases from the same source chunk, and synonymy edges connect semantically similar episodes (Shu et al., 13 Feb 2026). This structure supports temporal filtering, ordinal reasoning, and cross-event composition.

CompassMem represents memory as an Event Graph

$G^{(t)}=(V^{(t)},E^{(t)}),$

with event nodes $e_{t_i}=\langle o_{t_i}, \tau_{t_i}, s_{t_i}, \pi_{t_i}\rangle$ and typed relations such as causal, motivation, enablement, follow_up, temporal_before, temporal_after, contrast, part_of, parallel, and elaboration (Hu et al., 8 Jan 2026). MRAgent instead builds a Cue–Tag–Content graph $M=(C,V,R)$ in which associative tags mediate the link between cues and content, enabling the system to reason about which relations are promising before opening expensive content nodes (Ji et al., 4 Jun 2026).

MemoBrain uses a dependency-aware memory graph over abstracted “thoughts,” not raw traces. Each episode $x_t=(\tau_t,\omega_t)$ is transformed into a thought $v_t=\phi(x_t,\mathcal{G}_{t-1})$ , inserted into a directed graph with explicit dependencies, and then managed by folding or flushing operations under a fixed context budget (Qian et al., 12 Jan 2026). This makes the memory object trajectory-level and control-oriented.

The latent RiM method replaces textual intermediate thoughts with fixed memory blocks

$l$ 0

which are part of the input rather than generated. Their token identities are fixed, but their contextual states become task dependent, so the blocks function as a trainable working memory inside an otherwise standard Transformer (Aichberger et al., 28 May 2026).

These representations imply different memory semantics. Trace memories preserve candidate derivations and their judged quality; episodic graphs preserve who–what–where–when structure; associative graphs preserve bridges among cues, tags, and contents; executive graphs preserve dependency structure and control state; latent blocks preserve no explicit symbolic content at all, but provide a reusable computation substrate.

3. Inference patterns: reading, writing, traversal, and control

RiM methods also differ in the algorithmic pattern by which reasoning unfolds through memory. Several recurrent motifs appear across the literature.

In ReM-MoA, each layer writes traces, the Reviewer Agent performs comparative scoring, and later layers read curated subsets of the global memory pool. The routing operator

$l$ 1

selects different mixtures of highest-scoring, lowest-scoring, and contrastive traces via Top $l$ 2, Bot $l$ 3, and Con $l$ 4 routes (Ping et al., 23 Jun 2026). The paper explicitly frames this with the linear model

$l$ 5

where sustained Mixture-of-Agents scaling requires preserving both average proposer accuracy $l$ 6 and output diversity $l$ 7. RiM is introduced there as the missing mechanism that simultaneously propagates high-quality non-local reasoning and prevents diversity collapse.

Goal-Mem implements RiM as backward chaining over long-term conversational memory. A user utterance is parsed into a Natural Language Logic goal, decomposed into atomic subgoals, and each subgoal is issued as a targeted retrieval query. Retrieved facts are accepted only if unification succeeds under type consistency, equality with existing bindings, and logical entailment (Liang et al., 12 May 2026). If subgoals remain unresolved, refinement generates antecedents and the retrieval-unification loop repeats. This makes memory access proof-oriented rather than similarity-oriented.

REMem operationalizes inference with a ReAct-style agentic retriever. The agent alternates among semantic and lexical retrieval, graph exploration tools such as find_gist_contexts and find_entity_contexts, and output_answer, iteratively accumulating evidence until confidence is sufficient or a step cap is reached (Shu et al., 13 Feb 2026). MRAgent follows a related but more explicitly stateful pattern. Its reconstruction state is

$l$ 8

where $l$ 9 is the active set of cues, tags, and contents eligible for expansion and $M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 0 is accumulated evidence. Forward and reverse operators expand from cues to tags, from cue–tag pairs to contents, and from contents back to new cue–tag pairs, with LLM-based selection and routing used to prevent combinatorial explosion (Ji et al., 4 Jun 2026).

MemCog makes the navigation policy itself part of the cognitive loop. Its Cross-Dimensional Navigation Interface exposes list_dimensions, browse_dimension(dim), read_page(page_id), and follow_link(link) inside a ReAct action space, while the Proactive Reasoning Protocol instructs the agent to initiate navigation when mentions, temporal cues, or contradictions suggest that memory is relevant (Li et al., 27 May 2026). Retrieval is thus neither one-shot nor purely reactive.

MemoBrain’s inference pattern is closer to executive control than retrieval. After each reasoning episode, the system abstracts a semantic outcome, updates the graph, and, when the budget is reached, applies

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 1

with operations drawn from Fold(\cdot) and Flush(\cdot) (Qian et al., 12 Jan 2026). RiM here means that subsequent reasoning is projected from a compact backbone

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 2

not from the raw accumulated trace.

The latent RiM method takes the strongest possible stance against externalized intermediate reasoning. Memory blocks are processed in one forward pass, and only the final answer is decoded autoregressively. Inference thus becomes iterative answer refinement over progressively larger prefixes of memory blocks, rather than token-by-token chain-of-thought generation (Aichberger et al., 28 May 2026).

Across these systems, RiM replaces a single retrieval decision with a memory process: ranking, selective exposure, traversal, refinement, compression, or latent-state evolution.

4. Learning, supervision, and optimization in RiM systems

Some RiM systems are primarily inference-time overlays, but many rely on specialized training procedures to make memory computationally useful rather than merely accessible.

ReM-MoA adds an optional Reviewer distillation pipeline. A frontier teacher $M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 3 labels proposer traces with scores and rationales, producing

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 4

and a smaller Reviewer $M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 5 is then LoRA-fine-tuned with supervised cross-entropy over the tokenized score-rationale output (Ping et al., 23 Jun 2026). The teacher sees ground truth during labeling but constrains rationales not to restate answers, so the student learns to judge reasoning quality rather than memorize labels.

MemoBrain uses a two-stage training regime. Stage I performs supervised fine-tuning for memory construction with

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 6

while Stage II uses Direct Preference Optimization over operation sets to learn folding and flushing decisions under budget (Qian et al., 12 Jan 2026). This makes memory management itself a learned policy.

The latent RiM method also uses a two-stage curriculum, but with a different target. Stage 1 grounds each memory block by predicting the next explicit reasoning step after that block,

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 7

and Stage 2 drops step supervision and trains the model to iteratively refine the final answer,

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 8

thereby converting textual reasoning scaffolds into latent working memory (Aichberger et al., 28 May 2026).

Memory-augmented reinforcement learning for tiny LLMs introduces another form of RiM. Successful and failed chains of thought are stored in separate episodic buffers, and intrinsic rewards are computed by pulling the current response embedding toward the centroid of successful responses and away from the most similar failed responses:

$M_l = \{(r_{l,j}, s_{l,j}, \rho_{l,j})\}_{j=1}^N,\qquad M_{\le l} = \bigcup_{k=1}^{l} M_k,$ 9

The total objective is

$N \cdot l$ 0

so policy updates optimize reasoning relative to memory rather than outcome rewards alone (Le et al., 3 Apr 2025). This suggests a broader interpretation of RiM in which memory acts not only as inference context but also as an intrinsic reward landscape.

A different training concern appears in work on answer attribution. "Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models" argues that large reasoning models answer through concurrent chain-of-thought and memory-retrieval pathways, and introduces FARL, which interleaves GRPO-style RL with Negative Preference Optimization to suppress retrieval shortcuts during training (Wang et al., 29 Sep 2025). The motivation is that retrieval can “hack” the reward signal, so a reasoning-oriented RiM system may need explicit memory unlearning. In a related mechanistic direction, "The Reasoning-Memorization Interplay in LLMs Is Mediated by a Single Direction" identifies layerwise Linear Reasoning Features and shows that adding or suppressing a single residual-stream direction can bias an LLM toward generalizable reasoning or memory-based recall (Hong et al., 29 Mar 2025). These papers do not build explicit external memory modules, but they treat the reasoning–memory balance itself as an object of optimization and control.

The literature therefore contains at least four training regimes for RiM: supervision of memory judgments, supervision of memory management, curriculum-based grounding of latent workspace, and optimization of the reasoning–memory trade-off itself.

5. Empirical performance and scaling behavior

The empirical case for RiM is strongest where naive context accumulation, one-shot retrieval, or shallow multi-agent interaction saturate.

ReM-MoA reports consistent gains across five reasoning benchmarks spanning math, formal logic, code, knowledge, and commonsense. Under depth scaling with $N \cdot l$ 1 and $N \cdot l$ 2, the advantage widens with depth. At $N \cdot l$ 3, ReM-MoA* reaches 84.0 on MATH, 70.2 on Formal Logic, 73.0 on CRUX, 70.2 on MMLU-redux, and 84.3 on HellaSwag, all above AttentionMoA in the same settings (Ping et al., 23 Jun 2026). The same paper reports that removing cross-layer access causes the largest ablation drop and destroys sustained scaling, while removing routing diversification also collapses sustained scaling. This directly supports the claim that structured cross-layer memory plus diversified routing, rather than sheer agent count, is what sustains MoA depth.

REMem reports 3.4% absolute improvement on episodic recollection and 13.4% absolute improvement on episodic reasoning relative to strong baselines, and achieves Test of Time exact match 93.1, the only method exceeding 90% EM in that benchmark (Shu et al., 13 Feb 2026). The ablations are diagnostically important: removing gists causes the largest drop, showing that situational context, not just fact extraction, is central to episodic RiM.

MemCog reports 92.98 on LoCoMo and 95.8 on LongMemEval, while on ProactiveMemBench the full system reaches Recall@5 = 59.51, LLM-judged Precision = 87.58, and human-judged Precision = 91.02 (Li et al., 27 May 2026). The ablations show that removing the Proactive Reasoning Protocol sharply degrades proactive recall and precision, whereas removing the graph overlay also causes clear losses. This indicates that proactive triggering and navigable structure are complementary, not interchangeable.

MRAgent reports significant improvements on LoCoMo and LongMemEval, with up to 23% relative improvement over strong baselines and markedly reduced token and runtime cost. On LongMemEval, under the Gemini backbone, MRAgent achieves overall LLM-Judge 72.95 versus 54.92 for MemoryOS, and on the same benchmark MRAgent consumes 118k tokens versus 632k for A-Mem and 3,268k for LangMem (Ji et al., 4 Jun 2026). The result is noteworthy because it shows that active reconstruction can improve both accuracy and efficiency when content retrieval is deferred until tags pass lightweight gating.

CompassMem reports consistent gains on LoCoMo and NarrativeQA across GPT-4o-mini and Qwen2.5-14B-Instruct backbones. For example, on LoCoMo with GPT-4o-mini, average F1 rises to 52.18 versus 47.92 for HippoRAG, while temporal F1 reaches 57.96 (Hu et al., 8 Jan 2026). The ablations show that removing topic clustering, event modeling, explicit edges, query refinement, or subgoal generation causes consistent performance drops, especially on multi-hop and temporal QA.

Goal-Mem improves a wide range of backbones and storage schemes, with particularly strong gains on multi-hop questions. On LoCoMo with Gemma-4-26B, Dense-RAG accuracy rises from 62.39 to 79.44 and BM25-RAG from 60.51 to 77.69 under Goal-Mem’s reasoning layer (Liang et al., 12 May 2026). Because Goal-Mem is structure-agnostic, these gains support the claim that reasoning policy over memory can matter as much as the storage backend.

RiM as latent working memory shows a different empirical profile. On GSM8K, the Llama-3.2-1B RiM model reaches 42.1% greedy accuracy versus 23.9% for SFT without CoT and 36.9% for Coconut, while time to first token is 16.1 ms, compared with 108.3 ms for Coconut and 420.3 ms for SFT with CoT (Aichberger et al., 28 May 2026). This is evidence for a specific RiM thesis: latent memory can recover part of the accuracy benefits of reasoning tokens without paying their autoregressive latency.

A parallel efficiency-oriented line appears in ENGRAM-R. On LoCoMo, it reduces input tokens from 28,371,703 to 3,293,478 and reasoning tokens from 1,335,988 to 378,424, with p50 total latency dropping from 7.89 s to 2.56 s (Patel et al., 17 Nov 2025). This system is not primarily about richer reasoning algorithms, but it supports the broader RiM claim that structured memory reuse can replace expensive recomputation.

6. Limitations, controversies, and open problems

RiM systems inherit the limitations of memory itself: bias, staleness, over-conditioning, contamination, and opacity. The literature increasingly treats these not as implementation details but as first-order research problems.

In ReM-MoA, the Reviewer Agent introduces one extra LLM call per layer, and the authors explicitly note judge biases, ranking errors, append-only memory contamination, and possible overfitting to top-ranked traces (Ping et al., 23 Jun 2026). REMem reports failure modes such as selection and grounding errors, temporal window mismatches, incomplete multi-entity lists, and abstentions despite evidence retrieval (Shu et al., 13 Feb 2026). MemCog highlights token overhead, instruction-following variance, data-quality brittleness, sparse early histories, over-triggering or under-triggering, hallucination risks under proactive recall, and privacy or consent concerns for sensitive memories (Li et al., 27 May 2026). These are not peripheral caveats; they reflect the fact that RiM strengthens the causal role of memory in inference, so memory defects become reasoning defects.

A more structural concern is that memory can reshape reasoning even when the final answer remains plausible. "DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized LLMs" introduces a ground-truth-free framework that compares no-memory and memory-injected reasoning trajectories through symbolic value categories and sequence divergence metrics. Across four LLMs and 10 user-attribute categories, user-attribute memory induces medium-to-large drift above each model’s pragmatic-noise floor, with effect sizes reported as Cohen’s $N \cdot l$ 4– $N \cdot l$ 5, and Trans status and Disability ranking among the top three drift-inducing categories on seven of eight model×metric panels (Fang et al., 2 Jul 2026). This is a RiM-specific reliability issue: even irrelevant memory can silently change the justification path.

Another controversy concerns whether contemporary “reasoning” is actually reasoning or a competition between reasoning and retrieval. The answer-attribution study reports non-zero Reasoning Perturbation Success Rate and Retrieval Perturbation Success Rate across datasets and models, arguing that both pathways operate simultaneously. Distillation-based models show higher post-hoc explanation rates, while FARL reduces perturbation susceptibility by suppressing retrieval shortcuts during RL (Wang et al., 29 Sep 2025). The mechanistic LiReF paper reaches a compatible conclusion from the inside of the model: a single layerwise residual-stream direction can steer the system toward reasoning mode or memorization mode, and the strongest separation appears in middle layers (Hong et al., 29 Mar 2025). Together, these findings suggest that RiM cannot be studied solely as external memory engineering; it also interacts with internal memorization circuits and inference-time mode selection.

Open problems therefore cluster around four themes. First, governance of memory: relevance gating, privacy controls, sensitivity-aware surfacing, and stale-memory management remain under-specified. Second, calibration of memory-mediated judgments: reviewer models, entailment checks, and ontology labelers can all inject systematic error. Third, scaling laws: many systems have only been tested up to limited widths, depths, model sizes, or graph scales. Fourth, faithfulness: latent working memory and compressed executive memory improve efficiency, but they also reduce the visibility of intermediate computation, creating a tension between performance and interpretability.

RiM has consequently evolved from a memory-augmentation trick into a broader research program. The central question is no longer whether a model should have memory, but how memory should be structured, traversed, judged, and constrained so that reasoning remains scalable, verifiable, and robust.