Recurrent Memory Transformers

Updated 3 February 2026

Recurrent Memory Transformers are neural architectures that augment standard Transformers with explicit, persistent memory tokens and recurrent update strategies.
They employ innovative memory update rules—such as gated recurrence, associative updates, and cache compression—to efficiently manage long sequences.
These models demonstrate high performance in language, vision, and reinforcement learning by reducing computational complexity and enabling long-context processing.

A Recurrent Memory Transformer is a neural sequence model architecture that augments the standard Transformer with explicit, persistent memory structures that propagate information recurrently across time, segments, or layers. This class of models addresses the context-length, efficiency, and information bottleneck limitations of standard attention mechanisms by introducing explicit recurrent connections, memory tokens/vectors, or hidden-state compression and retention strategies. Recent advancements span gated recursion, segment-wise memory tokens, fast associative update rules, and adaptive memory compression, yielding models capable of long-context reasoning across diverse domains.

1. Architectural Foundations and Core Mechanisms

Recurrent Memory Transformers (RMTs) generalize segment-level recurrence by introducing explicit memory pathways connecting different time steps or segments in the sequence. The canonical design splits a long input sequence into fixed-length segments, augmenting each segment with a compact set of memory vectors (e.g., tokens, caches, or associative matrices) that serve as a cross-segment, persistent state. These memory blocks are updated in a recurrent fashion—either at each forward pass or via specialized update rules—enabling information to propagate beyond the local attention window.

In "Event-based Monocular Dense Depth Estimation with Recurrent Transformers," the EReFormer instantiates this paradigm with a Gate Recurrent Vision Transformer (GRViT) unit: at each time step $t$ , a hidden-state $h_{t-1}$ (of the same shape as the feature map $f_t$ ) is propagated forward and updated via an attention-based and gated residual mechanism:

$Q_t = f_t W^f_Q + h_{t-1} W^h_Q, \quad K_t = f_t W^f_K + h_{t-1} W^h_K, \quad V_t = f_t W^f_V + h_{t-1} W^h_V$

Attention is computed using an ELU-based linear attention operator, then combined with the previous hidden state by an element-wise gate:

$U_t = \sigma([f_t; h_{t-1}] W_p), \quad h_t = (1 - U_t) \circ h_{t-1} + U_t \circ A_t$

where $A_t$ is the multi-head attention output and $U_t$ is a channel-wise gate. This mechanism allows the recurrent state to summarize past temporal dynamics while maintaining computational tractability (Liu et al., 2022).

Segment-level recurrence is also the basis of the RMT architecture:

The long sequence is divided into segments $S_1, S_2, \ldots$ of length $L$ .
Each segment $\tau$ receives as input a set of $M$ memory vectors $m_{\tau}$ from the prior segment's output memory.
The standard Transformer layers operate over the concatenation $[m_{\tau};$ segment tokens $; m_{\tau}]$ .
After processing, the output memory for the next segment is extracted as the last $M$ rows of the output (Bulatov et al., 2022).

This architecture provides a direct, differentiable pathway for transferring information across arbitrarily long sequences via bounded, recurrent memory.

2. Memory Update Rules and Compression Strategies

The efficiency and scalability of RMTs are largely determined by how memory is updated and compressed over time. Several strategies have emerged:

Simple Overwrite: In early RMT variants, the output memory is a full overwrite of the previous state—i.e., $m_{\tau+1} :=$ write-memory from segment $\tau$ (Bulatov et al., 2022). This approach is simple but can yield forgetting for very long histories.
Gated/GRU-like Update: Gated updates, as in the MART model and EReFormer GRViT, allow adaptive mixing of old and new memory. For MART:

$M_t^\ell = Z_t^\ell \odot M_{t-1}^\ell + (1 - Z_t^\ell) \odot C_t^\ell$

with $Z_t^\ell$ a learned gate determined by both the old memory and new summary (Lei et al., 2020).

Associative and Delta-Rule Updates: In ARMT, each layer maintains a key–value associative matrix $A_s^l$ ; memory is written by a fast-weight $v \otimes \phi(k)$ delta-rule with a correction term to prevent catastrophic forgetting (Rodkin et al., 2024). Reads apply a content-based attention recall with fixed computational cost, enabling scalability to millions of tokens without loss of information.
Cache Compression Policies: "Transformers are Multi-State RNNs" models the autoregressive Transformer as an unbounded multi-state RNN, where the "hidden state" is the key–value cache growing linearly with time. To bound memory, cache compression policies discard or select which token states to retain. The TOVA (Token Omission via Attention) policy retains the $k$ most-attended tokens as measured by the current query's softmax attention over all candidates, consistently outperforming baseline FIFO and fixed-window schemes while consuming as little as $\frac{1}{8}$ of the original cache size (Oren et al., 2024).
Biologically Inspired Compression: RMAAT uses an astrocyte-inspired adaptive compression, where a retention factor $r_t$ derived from simulated long-term potentiation determines how much of the recurrent memory is preserved per segment. The update is

$\mathrm{mem}_{t+1} = r_t(T) \widetilde{\mathrm{mem}_{t+1}}$

with $r_t$ varying systematically to enforce long-term compression and prevent unbounded growth (Mia et al., 1 Jan 2026).

3. Integration with Transformer Architectures

Several integration strategies have been developed to merge explicit recurrent memory with Transformer-style self-attention:

Memory Tokens in Input/Output: RMT (Bulatov et al., 2022), RATE (Cherepanov et al., 2023), and ARMT (Rodkin et al., 2024) introduce explicit memory tokens as part of each segment’s input and output, enabling both write and read access via self-attention. In multi-head self-attention, memory tokens function as regular attention participants. For example:

$\mathrm{Attention}(Q, [K_{\text{seq}}; K_{\text{mem}}], [V_{\text{seq}}; V_{\text{mem}}])$

Hierarchical and Layerwise Memory Propagation: ARMT maintains an associative memory in each layer, updated in parallel across layers and segments, which allows layer-specific specialization and efficient aggregation of context (Rodkin et al., 2024). The original RMT propagates memory only vertically (segment-to-segment), but ARMT extends this to all layers (horizontal propagation), interfacing with advanced scheduling (Diagonal Batching, see below) (Sivtsov et al., 5 Jun 2025).
Hybrid Fusion and Cross-Attention: The EReFormer model fuses the outputs of Swin-Transformer blocks at each stage with spatial transformer fusion (STF), employing two-stage cross-attention blocks that act as high-bandwidth skip connections between encoder and decoder blocks, coupled with temporally recurrent GRViT units to summarize sequence history (Liu et al., 2022).

4. Memory Efficiency, Complexity, and Scaling

Substituting self-attention over the full context with recurrent memory reduces computational cost from quadratic time/memory in sequence length to linear or even $O(1)$ per token for fixed-size memories:

Model	Memory Complexity	Time Complexity	Comments
Transformer (full)	$O(N \cdot d)$	$O(N^2 \cdot d)$	Scales poorly with long sequences
RMT/ARMT (segmental)	$O(K \cdot d + m \cdot d)$	$O(N \cdot L \cdot d)$	$K$ = segment size, $m$ = mem size
ARMT (const mem)	$O(1)$ per token	$O(1)$ per token	True with fixed segment/mem sizes
CRT (single-vector)	$O(1)$ per segment	$O(n^2 \cdot L + n \cdot d_m)$	Only single memory vector carried forward

Diagonal Batching (Sivtsov et al., 5 Jun 2025) further unlocks parallelism in RMTs, eliminating entirely the segmentwise sequential bottleneck. It reorders computation such that all layer–segment pairs with the same sum index can be computed in parallel, reducing effective synchronization steps in an $S \times L$ grid from $S \times L$ to $S + L - 1$ . This enables $3.3\times$ speedup over full attention and constant memory cost even at $131,072$-token context (Sivtsov et al., 5 Jun 2025).

5. Empirical Performance Across Domains

Extensive empirical evaluation demonstrates the efficacy of recurrent-memory mechanisms in diverse tasks:

Algorithmic Reasoning: RMT achieves perfect generalization on Copy, Reverse, and Quadratic tasks with up to 9 input segments, where baselines and Transformer-XL fail (Bulatov et al., 2022).
Language Modeling: CRT achieves state-of-the-art perplexity on WordPTB and WikiText-103, with lower latency and parameter count relative to Transformer-XL (Mucllari et al., 2 May 2025). In PG-19 and WikiText-103, adding simple windowed recurrence to GPT-2 yields significant perplexity reduction at no extra FLOPs (Yoshida et al., 2020).
Vision: Cached Transformers with gated recurrent cache attention improve ImageNet1k top-1 from $79.9\%$ to $81.3\%$ (ViT-S), as well as achieving gains in detection and segmentation (Zhang et al., 2023). EReFormer achieves leading dense event-based depth estimation on synthetic and real-world datasets (Liu et al., 2022).
Long-Range Reasoning: ARMT attains state-of-the-art on BABILong multi-task long-context retrieval, maintaining $79.9\%$ accuracy at $50$ million tokens, surpassing RMT and Mamba (Rodkin et al., 2024).
Temporal Sequence Tasks: RMAAT with adaptive memory retention and astromorphic attention achieves top LRA performance with lower peak memory and higher throughput than all prior RMTs (Mia et al., 1 Jan 2026).
Dialogue, Summarization, RL: Enable significant gains in dialogue modeling (perplexity $\sim$ 15.2 vs. $\sim$ 18.0), document understanding (ROUGE-L +2.3), and offline RL with POMDPs (Kashyap, 1 Jul 2025, Cherepanov et al., 2023).

6. Domain-Specific RMT Variants

A variety of RMT specializations address domain constraints:

Vision: EReFormer (Liu et al., 2022) processes asynchronous event camera streams, fusing spatial and temporal context via GRViT and multi-level transformer encoding.
Reinforcement Learning: RATE (Cherepanov et al., 2023) propagates memory tokens for each segment, yielding robust policy learning in POMDPs by extending effective context.
Language and Document Tasks: PARA-COMET (Gabriel et al., 2020) recurrently buffers past inferences as external memory to enforce discourse-wide coherence in narrative commonsense generation. "Learn To Remember" introduces explicit memory "read" and "write" attentions for document-level machine translation, consistently outperforming context-free and simple cache baselines (Feng et al., 2022).

7. Limitations and Open Research Problems

Despite their strengths, RMTs have several practical constraints:

Memory Capacity: Fixed-size memory imposes a hard information bottleneck; if incompressible long-range dependencies exceed this capacity, performance may degrade. Orthogonal or associative memories (as in ARMT) partially mitigate this issue (Rodkin et al., 2024).
Stability of Training: Backpropagation through time over many segments increases memory footprint and can induce instability. Techniques such as stop-gradient, memory detachment, or replay-based gradient (AMRB) are necessary for very long contexts (Mia et al., 1 Jan 2026).
Expressivity vs. Efficiency: Increasing memory size or memory update complexity can erode the efficiency advantages. Layer-wise or chunked memory architectures (CRT, GUTLB) seek to balance this trade-off (Mucllari et al., 2 May 2025, Chowdhury et al., 2024).
Task/Domain Sensitivity: On fully observable or pure local tasks, memory augmentation yields minimal benefit and adds overhead (Cherepanov et al., 2023).

Open problems include adaptive memory sizing, hierarchical or multi-scale recurrence, optimally learnable memory update gates, and combining RMTs with efficient linear/sparse attention kernels for further scaling.

In sum, Recurrent Memory Transformers encompass a heterogenous family of architectures that extend the Transformer paradigm with explicit, adaptive, and compressible memory mechanisms. Through innovations in memory formulation (tokens, vectors, associative or biological models), update rules (gate, cache, fast-weights), and scheduling (diagonal batching, chunk-wise recurrence), these models unlock efficient, scalable, and high-capacity sequence processing across language, vision, and reinforcement learning domains (Liu et al., 2022, Bulatov et al., 2022, Rodkin et al., 2024, Mia et al., 1 Jan 2026, Sivtsov et al., 5 Jun 2025, Mucllari et al., 2 May 2025).

Markdown Upgrade to Chat

References (15)

Event-based Monocular Dense Depth Estimation with Recurrent Transformers (2022)

Recurrent Memory Transformer (2022)

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning (2020)

Associative Recurrent Memory Transformer (2024)

Transformers are Multi-State RNNs (2024)

RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient Long-Context Transformers (2026)

Recurrent Action Transformer with Memory (2023)

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts (2025)

Compact Recurrent Transformer with Persistent Memory (2025)

10.

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size (2020)

11.

Cached Transformers: Improving Transformers with Differentiable Memory Cache (2023)

12.

Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling (2025)

13.

Paragraph-level Commonsense Transformers with Recurrent Memory (2020)

14.

Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation (2022)

15.

Investigating Recurrent Transformers with Dynamic Halt (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Memory Transformers.

Recurrent Memory Transformers

1. Architectural Foundations and Core Mechanisms

2. Memory Update Rules and Compression Strategies

3. Integration with Transformer Architectures

4. Memory Efficiency, Complexity, and Scaling

5. Empirical Performance Across Domains

6. Domain-Specific RMT Variants

7. Limitations and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Recurrent Memory Transformers

1. Architectural Foundations and Core Mechanisms

2. Memory Update Rules and Compression Strategies

3. Integration with Transformer Architectures

4. Memory Efficiency, Complexity, and Scaling

5. Empirical Performance Across Domains

6. Domain-Specific RMT Variants

7. Limitations and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research