Runtime Agent Memory Techniques

Updated 8 February 2026

Runtime agent memory is a dynamic system that enables agents to continuously store, retrieve, and update contextual information without altering model weights.
It leverages non-parametric reinforcement learning, dual-stream architectures, and query-aware retrieval to balance performance with bounded resources.
These approaches address challenges like stability-plasticity trade-offs, context drift, and security in both single and multi-agent deployments.

Runtime agent memory denotes the set of architectures, algorithms, and strategies by which autonomous agents dynamically store, retrieve, and update information during interaction with their environment, tasks, or users. Unlike static precomputed memory or offline stores, runtime memory mechanisms operate continuously, enabling agents to adapt to new situations, reinforce learned behaviors, avoid catastrophic forgetting, maintain context across long horizons, and optimize resource efficiency—all without modifying model weights. A central challenge is reconciling the need for high-plasticity learning and context-consistency under bounded compute, storage, and latency constraints. Approaches span reinforcement learning over non-parametric experience stores, dual-stream data representations decoupling semantic reasoning from execution state, cost-aware and query-adaptive retrieval, memory CRUD policy optimization, and robust mechanisms for security and drift control.

1. Non-Parametric Episodic Memory and Reinforcement Learning Controllers

MemRL is a canonical framework in which a frozen LLM is augmented with a non-parametric memory bank consisting of (intent embedding, experience, Q-value) triplets. Each runtime episode proceeds by encoding the current query as an intent embedding, performing a two-phase retrieval—first, semantic top-k filtering using cosine similarity to the intent, and second, re-ranking by a convex combination of semantic similarity and a z-score–normalized utility (Q-value) reflecting the past reward for similar experiences. The frozen LLM reasons over the retrieved experiences, executes the new generation, and environmental feedback is used to online-update the Q-values solely in memory via a terminal form of the Bellman equation:

$Q_{\text{new}} = Q_{\text{old}} + \alpha (r - Q_{\text{old}})$

where $\alpha$ is a learning rate and $r$ is the observed reward. The LLM weights are strictly frozen; only the utility assignments of memory items evolve. This protocol enables continuous, stable self-improvement and reconciliation of the stability–plasticity dilemma, demonstrating substantial gains across long-horizon agent benchmarks relative to traditional retrieval-based and parametric fine-tuning approaches (Zhang et al., 6 Jan 2026).

2. Dual-Stream and Persistent State Architectures

Runtime agent memory benefits significantly from explicit separation between reasoning context (semantic) and execution state (runtime). CaveAgent exemplifies this by introducing a dual-stream model: an in-prompt semantic history (used for deliberation, function/variable schemas, and high-level reasoning steps) and a persistent, deterministic runtime namespace in a long-lived Python kernel or execution environment. All objects—variables, DataFrames, database handles—persist in this namespace and are referenced in semantic turns via symbolic handles. Only brief referential signatures appear in the prompt; full objects stay outside, avoiding repeated (de)serialization and maintaining contextual integrity even as the prompt size remains nearly flat over long dialogues. State changes are orchestrated through code generation and execution in the persistent namespace, with security and observation shaping enforced on each execution (Ran et al., 4 Jan 2026). This approach yields dramatic reductions in context token consumption—a 28.4% drop over traditional JSON or code-only agents—and robustly eliminates context drift and catastrophic forgetting for data-intensive tasks.

3. Query-Aware, Cost- and Performance-Controlled Retrieval

Conventional memory models process all possible past information regardless of the specifics of the downstream query, wasting compute and risking loss of query-critical details. BudgetMem reorganizes the architecture into modular memory pipelines, where each module (filtering, entity extraction, summarization, etc.) is instantiated at runtime in one of several "budget tiers" (LOW/MID/HIGH). Tier selection is governed by an actor-critic RL router that adapts tier choices per module, conditioned on query complexity, intermediate results, and a configurable trade-off parameter $\lambda$ balancing performance (e.g., F1 accuracy or LLM-Judge scores) against normalized API/token cost. Modules realize tiers through algorithmic complexity choice, model capacity, or reasoning depth (e.g., reflection versus direct retrieval). By learning query- and module-specific cost policies, BudgetMem consistently outperforms baselines in both high-accuracy and low-cost regimes and exposes accuracy–cost Pareto frontiers (Zhang et al., 5 Feb 2026). This approach enables agents to maintain tight runtime control over memory extraction and inference budget in practical deployments.

4. Memory Management Mechanisms in Multi-Agent and Large-Scale Settings

Scalable multi-agent deployments face additional runtime memory challenges—especially GPU memory pressure as the number of concurrent agents grows. Warp-Cortex introduces Singleton Weight Sharing (only one model instance loaded), Topological Synapse–inspired context landmarking (compressing each agent's context history to $k \ll L$ tokens via witness-complex principles), and non-intrusive KV-cache injection, reducing weight memory to $O(1)$ and context to $O(Nk)$ . Asynchronous CUDA-stream execution lets hundreds to thousands of agents share the same model instance, with each agent's persistent state stored as sparse key/value caches. Scalability is empirically demonstrated, e.g., over 100 active agents on 2.2 GB VRAM (Williams, 3 Jan 2026). ScaleSim generalizes this at the LLM serving layer by formalizing "invocation distance"—an estimate of how soon an agent will next invoke the LLM—and uses proactive prefetching and priority-based GPU memory eviction strategies, supporting large swarms of agents whose persistent memory states are loaded/unloaded in anticipation rather than on demand, achieving significant throughput and latency gains (Pan et al., 29 Jan 2026).

5. Learning-Based and Policy-Driven CRUD Memory Control

Runtime agent memory is increasingly formulated as a sequence of learnable, atomic memory operations. AtomMem decomposes memory interaction into CRUD (Create, Read, Update, Delete) primitives, casting the problem as a POMDP—each step emitting a (possibly empty) chain of atomic ops over a persistent vector-store memory. A hybrid of supervised pre-training and RL fine-tuning yields policies that learn to accumulate, update, and prune memories dynamically, conditioned on the task and current memory snapshot. The agent's memory state thus evolves as

$\mathcal{M}_{t+1} = a_t^{k}(\mathcal{M}_{t}) \text{ for } a_t^k \in \{\mathrm{Create},\mathrm{Update},\mathrm{Delete}\}$

with Reads fetching context and Updates made only as newly discovered evidence requires. Empirical results indicate that RL-trained controllers discover structured memory management strategies—favoring proactive "create," selective "update," and judicious "delete" over static routines–yielding significantly improved performance on multi-hop and long-context QA (Huo et al., 13 Jan 2026).

6. Robustness, Drift Control, and Security in Runtime Memory

Security and drift pose acute risks in persistent runtime memory. A-MemGuard addresses adversarial memory injection and self-reinforcing error cycles by layering consensus-based validation and a lesson-distillation/recall dual-memory mechanism over standard agent memory. Each retrieval is cross-checked through parallel reasoning path comparison (via LLM-as-judge, embeddings, or density clustering) to filter anomalous or malicious entries. When detected, failures are distilled as structured "lessons" and re-injected into the planning loop to break the error cycle. This paradigm yields >95% reduction in attack success, minimal utility loss on benign queries, and adaptation over time as memory-driven defenses evolve (Wei et al., 29 Sep 2025). Drift and hallucination are further controlled through mechanisms like the Agent Cognitive Compressor (ACC), which replaces transcript replay or naive retrieval with a bounded, schema-governed state that is rebuilt each turn via explicit recall, gating, and compression steps. Only qualified, verifiable artifacts are committed to the persistent state, enforcing memory boundedness and suppressing drift to near zero in long-horizon scenarios (Bousetouane, 15 Jan 2026).

7. Advanced Hierarchical, Structured, and Adaptive Memory Models

Recent runtime memory systems move beyond flat stores to hierarchical or compositional structures. xMemory organizes streaming agent experience into a four-level hierarchy (raw blocks, episodes, semantic facts, themes), with split-merge dynamics guided by balanced cluster sparsity and semantic coherence objectives. At retrieval, theme and semantic nodes are selected to maximize relevance and diversity, with incremental episode expansion gated by LLM uncertainty reduction. Compared to fixed-top-k RAG, xMemory improves answer quality and token efficiency by systematically avoiding redundancy, maintaining prerequisites, and only expanding memory where it reduces LLM entropy on the task (Hu et al., 2 Feb 2026). Hybrid episodic–semantic memory with intelligent decay (e.g., (Xu, 27 Sep 2025)) implements composite scoring over recency, relevance, and user utility to control pruning and consolidation, supporting continual agent operation under bounded resource constraints.

Summary Table: Representative Runtime Memory Mechanisms

Mechanism	Key Innovation	Representative Paper
Episodic Q-learning Memory	Semantic + utility RL re-ranking	MemRL (Zhang et al., 6 Jan 2026)
Dual-stream Architecture	Semantic/execution decoupling	CaveAgent (Ran et al., 4 Jan 2026)
Budget-Aware Tier Routing	RL module-tier performance/cost control	BudgetMem (Zhang et al., 5 Feb 2026)
Sparse Context Topology	Topological synapse/witness compression	Warp-Cortex (Williams, 3 Jan 2026)
Policy-driven CRUD	Learnable atomic memory operations	AtomMem (Huo et al., 13 Jan 2026)
Security Consensus & Lessons	Self-correcting anomaly defense	A-MemGuard (Wei et al., 29 Sep 2025)
Hierarchical Decoupled Retrieval	Structured theme/semantic expansion	xMemory (Hu et al., 2 Feb 2026)
Schema-Bounded Compression	Stable, bounded cognitive state	ACC (Bousetouane, 15 Jan 2026)

Runtime agent memory research thus encompasses a spectrum from non-parametric RL controllers, dual-stream persistent object management, modular and query-adaptive cost routing, multi-agent scale-optimized architectures, atomic policy-driven CRUD controllers, to adversary-resistant and drift-controlled memory compression. All approaches emphasize online plasticity, bounded context, security, and efficiency to meet the demands of autonomous decision-making and stable long-term operation across agent classes and environments.