Composite & Long-Range Memory Tasks

Updated 24 February 2026

Composite and long-range memory tasks are paradigms that require systems to retain, manipulate, and utilize extended information across interdependent subtasks.
They drive the development of dual-memory and unified agentic architectures, which combine short-term and long-term memory mechanisms to overcome bounded-context limitations.
Transformer-based models like Compressive Transformer and CoMeT leverage dynamic compression and retrieval policies, achieving significant performance gains on multi-modal and sequential benchmarks.

Composite and long-range memory tasks are paradigms for evaluating and enabling sequential decision-making systems to perform reasoning or action sequences that require remembering, manipulating, and utilizing information over extended temporal spans and across interdependent subtasks. These tasks have catalyzed the development of neural architectures, agentic policies, and memory-augmented mechanisms that address the fundamental limitations of bounded-context processing, especially in LLMs, multimodal systems, and embodied agents.

1. Formal Definition and Task Characterization

Composite tasks are defined as action or reasoning sequences composed of multiple dependent or independent subgoals, often requiring hierarchical decomposition: $\mathcal{T} = (\tau_1, \tau_2, ..., \tau_K)$ , with each subtask $\tau_k$ potentially conditioned on outcomes or observations from prior subtasks. Long-range memory tasks are characterized by their memory horizon $H$ , the maximal temporal lag $\Delta t$ between when a key observation or intermediate result is first encountered and when it must be recalled to solve a downstream subgoal. In 3DMem-Bench, memory horizons for simple, medium, and hard tasks span approximately 20, 40, and 70 actions, respectively; similar horizons appear in agentic QA and dialogue tasks spanning up to $10^7$ tokens (Hu et al., 28 May 2025, Tavakoli et al., 31 Oct 2025).

Key properties:

Temporal causality: Solution at step $t$ often depends on inputs or results received at $t'<t$ with $t-t'$ arbitrary and potentially much longer than model context ( $H \gg$ context window).
Subgoal chaining: Task structure requires retention and composition of multiple intermediate results, often across heterogeneous modalities, applications, or semantic domains.
Cross-context dependencies: Decisions in one context (e.g., application, room, dialogue turn) require memory or summarization of facts or events from distant, possibly unrelated contexts.

2. Memory System Architectures for Composite and Long-Range Tasks

Approaches to composite and long-range memory fall into two core families: dual-memory systems (separating short- and long-term memory) and unified agentic memory models.

Dual-Memory Models

Chain-of-Memory (CoM): Maintains explicit Short-Term Memory (STM) as a fixed-size FIFO window of recent action-result summaries ( $N=4$ ), and Long-Term Memory (LTM) as a dynamically filtered list of distilled facts. Insertions to LTM and STM are governed by task-conditioned policies, with each memory summarized and fused via transformer cross-attention at each decision step. The fusion pipeline enables both recent and persistent information, with update policies ensuring efficient state representation and limited information overload (Gao et al., 22 Jun 2025).
KARMA: Incorporates 3D scene graphs for LTM (persistent static structure) and STM as a cache of recent dynamic object states. Adaptive retrieval and replacement strategies—FIFO, W-TinyLFU, and frequency-based retention—ensure efficient real-time planning in embodied agents (Wang et al., 2024).
3DLLM-Mem: Uses working memory (WM) tokens for present observations and an episodic memory bank (EM) for temporally indexed past features. Dynamic attention-based fusion between WM and EM supports spatial-temporal reasoning in 3D environments (Hu et al., 28 May 2025).

Unified and Agentic Memory Models

Agentic Memory (AgeMem): Exposes both STM and LTM as structured “memory tools” within the agent action space, integrating add, update, delete, retrieve, summarize, and filter operations. The policy autonomously invokes memory actions as needed under context constraints, trained via a three-stage curriculum with stepwise group-normalized reinforcement learning (Yu et al., 5 Jan 2026).
Memory-as-Action (MemAct): Reframes working memory management as an RL policy, where memory edit actions (prune, summarize, insert) are explicit elements of the agent’s action space. DCPO segments fractured trajectories to propagate learning signals across memory and task actions (Zhang et al., 14 Oct 2025).
LIGHT framework: Implements tri-level memory: long-term episodic retrieval (dense vector index), working memory window, and a compressed scratchpad of salient facts, integrating all memory streams for answer generation in long-form dialogue and QA tasks (Tavakoli et al., 31 Oct 2025).

3. Transformer-Based and Neural Architectures

Several advances address the quadratic and linear complexity of standard transformers, and the gradient instability of RNNs in the long-memory regime:

Compressive Transformer: Augments Transformer-XL with a dual memory per layer: a short-term FIFO memory and a compressed long-term memory. Evicted short-term memories are lossily compressed (e.g., mean-pooling, convolutions) before enqueuing into long-term, allowing gradient-based optimization to scale to 100 $\times$ longer effective context windows. Compression-specific auxiliary losses (autoencoding, attention reconstruction) target optimal information retention (Rae et al., 2019).
CoMeT (Collaborative Memory Transformer): Uses a small, gated global memory blocking catastrophic forgetting and a FIFO temporary queue for recent details. At each chunk, both memories act as soft prompts for the next step; gating ensures long-term retention and linear, constant-space scaling. Demonstrates competitive retrieval and summarization at million-token scales, validated on agentic and summarization benchmarks (Zhao et al., 2 Feb 2026).
Scene Memory Transformer (SMT): For embodied agents in partially observable environments, all observations are embedded and appended to an unbounded memory. Temporal embeddings and self-attention over the entire memory (optionally factorized for computational efficiency) permit fine-grained spatio-temporal credit assignment with no subgoal bottlenecks (Fang et al., 2019).

4. Algorithmic Memory Management Protocols and Policies

Explicit memory policies are common across systems, with crucial implications for both competence and computational efficiency:

System	STM Mechanism	LTM Mechanism	Update/Retention
CoM (Gao et al., 22 Jun 2025)	FIFO queue ( $N$ =4)	Boolean-gated facts	Oldest dropped (STM); context switch triggers append (LTM)
KARMA (Wang et al., 2024)	LRU/LFU or merge	3D scene graph	Adaptive with counting Bloom filters (STM); incremental updates (LTM)
3DLLM-Mem (Hu et al., 28 May 2025)	Working memory tokens	Episodic memory bank	EM updated on revisit; WM always current step
AgeMem (Yu et al., 5 Jan 2026)	Ordered utterance list	Key-value store (encoding)	Add, update, delete tools (policy-learned)
CoMeT (Zhao et al., 2 Feb 2026)	FIFO queue	Gated global state	Layer-specific, chunk-by-chunk, parameterized gating
Compressive (Rae et al., 2019)	Primary FIFO	Pool/compressed memory	Learned/fixed operator, auxiliary loss drives utility

Where timing of retention, eviction, prioritization, and distinction between task-relevant and spurious information are policy-driven or learned.

5. Empirical Results and Benchmarking

Multiple frameworks have been validated on composite and long-range tasks from language modeling, embodied control, multi-hop QA, book/code summarization, and navigation:

On GUI Odyssey-CoM (cross-app navigation), CoM augments both zero-shot and 7B fine-tuned models, increasing Task-Switch Score from 12.67% to 30.36% and AMS from 33.72% to 39.38% (zero-shot); STM+LTM consistently outperforms either alone (Gao et al., 22 Jun 2025).
KARMA achieves $1.3\times$ improvement in composite task SR and up to $62.7\times$ reduction in execution time for complex long-range tasks in AI2-THOR; MRA improves by $4.9\times$ in complex settings (Wang et al., 2024).
3DLLM-Mem achieves $27.8\%$ SR and $42.1\%$ Sub-SR on the hardest in-the-wild setting, a +16pp gain over best alternatives; EQA/Captioning scores also rise by 10–20pp (Hu et al., 28 May 2025).
On BEAM ( $\leq 10^7$ -token dialogues), LIGHT yields mean F1 lifts of $3.5$– $12.7\%$ over baselines, with largest ablation drops upon removal of episodic retrieval or filtering (Tavakoli et al., 31 Oct 2025).
CoMeT matches or exceeds full-attention summarization on GovReport/ SummScreenFD, maintains 100% passkey retrieval at $10^6+$ tokens (Zhao et al., 2 Feb 2026).
Compressive Transfomer achieves 17.1 PPL on WikiText-103, 0.97 bpc on Enwik8, outperforming Transformer-XL; compression rate tuning is crucial for optimal trade-off (Rae et al., 2019).
SMT yields superior reward and coverage metrics (e.g., 3.69 found classes in search task, vs 3.14/3.07/3.53 for baselines) (Fang et al., 2019).

6. Analytical Insights, Limitations, and Future Directions

Several principles and bottlenecks have been identified through systematic ablation and analysis:

Effective memory management requires balancing immediate state tracking (STM) with persistent fact retention (LTM), with empirically observed optimal STM window sizes (e.g., $N=4$ in CoM).
Compression (as in Compressive Transformer, CoMeT) enables quadratic-to-linear reduction in computation and storage, but the correct granularity and adaptation protocols (e.g., gated updates, auxiliary losses preserving attention footprint) are essential to avoid catastrophic forgetting or information bottlenecking.
Unified agentic memory policies that allow tool-based, context-sensitive memory edits—trained end-to-end via RL—demonstrate superior context efficiency and compositional task performance compared to heuristic separation of memory systems (Yu et al., 5 Jan 2026, Zhang et al., 14 Oct 2025).
Limitations include capacity and retrieval latency scaling (in embodied settings), the need for semantic memory quality evaluation, and the challenge of dynamically adapting memory schemas and policies to unstructured or evolving task distributions.
Future directions emphasize multi-level or hierarchical memory schemes, hybridization with external/episodic retrieval, closed-loop verification of memory accuracy, and meta-learning for lifelong adaptation.

7. Generalization and Cross-Domain Applicability

The core dual-memory and agentic memory principles are task- and modality-agnostic, applying to GUI agents, LLM-based dialog, code or document summarization, physical robotics, and structured data reasoning:

PRISM demonstrates that schema-constrained in-context memory can be automatically adapted to new tasks via LLM-based schema synthesis and achieves almost flat inference costs as chunk size shrinks, with negligible degradation in task performance (Jayalath et al., 2024).
Biological inspiration, as in the Expressive Leaky Memory neuron, suggests that explicit, learned timescales for memory decay and nonlinearity yield substantial improvements in long-horizon dynamical tasks, supporting modular stacking for sequence modeling (Spieler et al., 2023).
MAES leverages multi-task, transfer-learned encoders to achieve perfect long-range generalization ( $T=1000$ ) by decoupling memory encoding and solution, emphasizing the value of compositional, position-robust separation of working-memory primitives (Jayram et al., 2018).

In summary, composite and long-range memory tasks provide a rigorous, multi-domain testbed for evaluating and advancing neural and agent architectures that integrate multi-timescale, policy- and task-adaptive memory representations. These advances collectively underpin progress toward sequential, context-rich reasoning and planning in both artificial and embodied intelligence.