Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Memory Transformers

Updated 3 February 2026
  • Recurrent Memory Transformers are neural architectures that augment standard Transformers with explicit, persistent memory tokens and recurrent update strategies.
  • They employ innovative memory update rules—such as gated recurrence, associative updates, and cache compression—to efficiently manage long sequences.
  • These models demonstrate high performance in language, vision, and reinforcement learning by reducing computational complexity and enabling long-context processing.

A Recurrent Memory Transformer is a neural sequence model architecture that augments the standard Transformer with explicit, persistent memory structures that propagate information recurrently across time, segments, or layers. This class of models addresses the context-length, efficiency, and information bottleneck limitations of standard attention mechanisms by introducing explicit recurrent connections, memory tokens/vectors, or hidden-state compression and retention strategies. Recent advancements span gated recursion, segment-wise memory tokens, fast associative update rules, and adaptive memory compression, yielding models capable of long-context reasoning across diverse domains.

1. Architectural Foundations and Core Mechanisms

Recurrent Memory Transformers (RMTs) generalize segment-level recurrence by introducing explicit memory pathways connecting different time steps or segments in the sequence. The canonical design splits a long input sequence into fixed-length segments, augmenting each segment with a compact set of memory vectors (e.g., tokens, caches, or associative matrices) that serve as a cross-segment, persistent state. These memory blocks are updated in a recurrent fashion—either at each forward pass or via specialized update rules—enabling information to propagate beyond the local attention window.

In "Event-based Monocular Dense Depth Estimation with Recurrent Transformers," the EReFormer instantiates this paradigm with a Gate Recurrent Vision Transformer (GRViT) unit: at each time step tt, a hidden-state ht−1h_{t-1} (of the same shape as the feature map ftf_t) is propagated forward and updated via an attention-based and gated residual mechanism:

Qt=ftWQf+ht−1WQh,Kt=ftWKf+ht−1WKh,Vt=ftWVf+ht−1WVhQ_t = f_t W^f_Q + h_{t-1} W^h_Q, \quad K_t = f_t W^f_K + h_{t-1} W^h_K, \quad V_t = f_t W^f_V + h_{t-1} W^h_V

Attention is computed using an ELU-based linear attention operator, then combined with the previous hidden state by an element-wise gate:

Ut=σ([ft;ht−1]Wp),ht=(1−Ut)∘ht−1+Ut∘AtU_t = \sigma([f_t; h_{t-1}] W_p), \quad h_t = (1 - U_t) \circ h_{t-1} + U_t \circ A_t

where AtA_t is the multi-head attention output and UtU_t is a channel-wise gate. This mechanism allows the recurrent state to summarize past temporal dynamics while maintaining computational tractability (Liu et al., 2022).

Segment-level recurrence is also the basis of the RMT architecture:

  • The long sequence is divided into segments S1,S2,…S_1, S_2, \ldots of length LL.
  • Each segment Ï„\tau receives as input a set of MM memory vectors mÏ„m_{\tau} from the prior segment's output memory.
  • The standard Transformer layers operate over the concatenation [mÏ„;[m_{\tau}; segment tokens ;mÏ„]; m_{\tau}].
  • After processing, the output memory for the next segment is extracted as the last MM rows of the output (Bulatov et al., 2022).

This architecture provides a direct, differentiable pathway for transferring information across arbitrarily long sequences via bounded, recurrent memory.

2. Memory Update Rules and Compression Strategies

The efficiency and scalability of RMTs are largely determined by how memory is updated and compressed over time. Several strategies have emerged:

  • Simple Overwrite: In early RMT variants, the output memory is a full overwrite of the previous state—i.e., mÏ„+1:=m_{\tau+1} := write-memory from segment Ï„\tau (Bulatov et al., 2022). This approach is simple but can yield forgetting for very long histories.
  • Gated/GRU-like Update: Gated updates, as in the MART model and EReFormer GRViT, allow adaptive mixing of old and new memory. For MART:

Mtℓ=Ztℓ⊙Mt−1ℓ+(1−Ztℓ)⊙CtℓM_t^\ell = Z_t^\ell \odot M_{t-1}^\ell + (1 - Z_t^\ell) \odot C_t^\ell

with Ztâ„“Z_t^\ell a learned gate determined by both the old memory and new summary (Lei et al., 2020).

  • Associative and Delta-Rule Updates: In ARMT, each layer maintains a key–value associative matrix AslA_s^l; memory is written by a fast-weight v⊗ϕ(k)v \otimes \phi(k) delta-rule with a correction term to prevent catastrophic forgetting (Rodkin et al., 2024). Reads apply a content-based attention recall with fixed computational cost, enabling scalability to millions of tokens without loss of information.
  • Cache Compression Policies: "Transformers are Multi-State RNNs" models the autoregressive Transformer as an unbounded multi-state RNN, where the "hidden state" is the key–value cache growing linearly with time. To bound memory, cache compression policies discard or select which token states to retain. The TOVA (Token Omission via Attention) policy retains the kk most-attended tokens as measured by the current query's softmax attention over all candidates, consistently outperforming baseline FIFO and fixed-window schemes while consuming as little as 18\frac{1}{8} of the original cache size (Oren et al., 2024).
  • Biologically Inspired Compression: RMAAT uses an astrocyte-inspired adaptive compression, where a retention factor rtr_t derived from simulated long-term potentiation determines how much of the recurrent memory is preserved per segment. The update is

memt+1=rt(T)memt+1~\mathrm{mem}_{t+1} = r_t(T) \widetilde{\mathrm{mem}_{t+1}}

with rtr_t varying systematically to enforce long-term compression and prevent unbounded growth (Mia et al., 1 Jan 2026).

3. Integration with Transformer Architectures

Several integration strategies have been developed to merge explicit recurrent memory with Transformer-style self-attention:

  • Memory Tokens in Input/Output: RMT (Bulatov et al., 2022), RATE (Cherepanov et al., 2023), and ARMT (Rodkin et al., 2024) introduce explicit memory tokens as part of each segment’s input and output, enabling both write and read access via self-attention. In multi-head self-attention, memory tokens function as regular attention participants. For example:

Attention(Q,[Kseq;Kmem],[Vseq;Vmem])\mathrm{Attention}(Q, [K_{\text{seq}}; K_{\text{mem}}], [V_{\text{seq}}; V_{\text{mem}}])

  • Hierarchical and Layerwise Memory Propagation: ARMT maintains an associative memory in each layer, updated in parallel across layers and segments, which allows layer-specific specialization and efficient aggregation of context (Rodkin et al., 2024). The original RMT propagates memory only vertically (segment-to-segment), but ARMT extends this to all layers (horizontal propagation), interfacing with advanced scheduling (Diagonal Batching, see below) (Sivtsov et al., 5 Jun 2025).
  • Hybrid Fusion and Cross-Attention: The EReFormer model fuses the outputs of Swin-Transformer blocks at each stage with spatial transformer fusion (STF), employing two-stage cross-attention blocks that act as high-bandwidth skip connections between encoder and decoder blocks, coupled with temporally recurrent GRViT units to summarize sequence history (Liu et al., 2022).

4. Memory Efficiency, Complexity, and Scaling

Substituting self-attention over the full context with recurrent memory reduces computational cost from quadratic time/memory in sequence length to linear or even O(1)O(1) per token for fixed-size memories:

Model Memory Complexity Time Complexity Comments
Transformer (full) O(Nâ‹…d)O(N \cdot d) O(N2â‹…d)O(N^2 \cdot d) Scales poorly with long sequences
RMT/ARMT (segmental) O(Kâ‹…d+mâ‹…d)O(K \cdot d + m \cdot d) O(Nâ‹…Lâ‹…d)O(N \cdot L \cdot d) KK = segment size, mm = mem size
ARMT (const mem) O(1)O(1) per token O(1)O(1) per token True with fixed segment/mem sizes
CRT (single-vector) O(1)O(1) per segment O(n2â‹…L+nâ‹…dm)O(n^2 \cdot L + n \cdot d_m) Only single memory vector carried forward

Diagonal Batching (Sivtsov et al., 5 Jun 2025) further unlocks parallelism in RMTs, eliminating entirely the segmentwise sequential bottleneck. It reorders computation such that all layer–segment pairs with the same sum index can be computed in parallel, reducing effective synchronization steps in an S×LS \times L grid from S×LS \times L to S+L−1S + L - 1. This enables 3.3×3.3\times speedup over full attention and constant memory cost even at $131,072$-token context (Sivtsov et al., 5 Jun 2025).

5. Empirical Performance Across Domains

Extensive empirical evaluation demonstrates the efficacy of recurrent-memory mechanisms in diverse tasks:

  • Algorithmic Reasoning: RMT achieves perfect generalization on Copy, Reverse, and Quadratic tasks with up to 9 input segments, where baselines and Transformer-XL fail (Bulatov et al., 2022).
  • Language Modeling: CRT achieves state-of-the-art perplexity on WordPTB and WikiText-103, with lower latency and parameter count relative to Transformer-XL (Mucllari et al., 2 May 2025). In PG-19 and WikiText-103, adding simple windowed recurrence to GPT-2 yields significant perplexity reduction at no extra FLOPs (Yoshida et al., 2020).
  • Vision: Cached Transformers with gated recurrent cache attention improve ImageNet1k top-1 from 79.9%79.9\% to 81.3%81.3\% (ViT-S), as well as achieving gains in detection and segmentation (Zhang et al., 2023). EReFormer achieves leading dense event-based depth estimation on synthetic and real-world datasets (Liu et al., 2022).
  • Long-Range Reasoning: ARMT attains state-of-the-art on BABILong multi-task long-context retrieval, maintaining 79.9%79.9\% accuracy at $50$ million tokens, surpassing RMT and Mamba (Rodkin et al., 2024).
  • Temporal Sequence Tasks: RMAAT with adaptive memory retention and astromorphic attention achieves top LRA performance with lower peak memory and higher throughput than all prior RMTs (Mia et al., 1 Jan 2026).
  • Dialogue, Summarization, RL: Enable significant gains in dialogue modeling (perplexity ∼\sim15.2 vs. ∼\sim18.0), document understanding (ROUGE-L +2.3), and offline RL with POMDPs (Kashyap, 1 Jul 2025, Cherepanov et al., 2023).

6. Domain-Specific RMT Variants

A variety of RMT specializations address domain constraints:

  • Vision: EReFormer (Liu et al., 2022) processes asynchronous event camera streams, fusing spatial and temporal context via GRViT and multi-level transformer encoding.
  • Reinforcement Learning: RATE (Cherepanov et al., 2023) propagates memory tokens for each segment, yielding robust policy learning in POMDPs by extending effective context.
  • Language and Document Tasks: PARA-COMET (Gabriel et al., 2020) recurrently buffers past inferences as external memory to enforce discourse-wide coherence in narrative commonsense generation. "Learn To Remember" introduces explicit memory "read" and "write" attentions for document-level machine translation, consistently outperforming context-free and simple cache baselines (Feng et al., 2022).

7. Limitations and Open Research Problems

Despite their strengths, RMTs have several practical constraints:

Open problems include adaptive memory sizing, hierarchical or multi-scale recurrence, optimally learnable memory update gates, and combining RMTs with efficient linear/sparse attention kernels for further scaling.


In sum, Recurrent Memory Transformers encompass a heterogenous family of architectures that extend the Transformer paradigm with explicit, adaptive, and compressible memory mechanisms. Through innovations in memory formulation (tokens, vectors, associative or biological models), update rules (gate, cache, fast-weights), and scheduling (diagonal batching, chunk-wise recurrence), these models unlock efficient, scalable, and high-capacity sequence processing across language, vision, and reinforcement learning domains (Liu et al., 2022, Bulatov et al., 2022, Rodkin et al., 2024, Mia et al., 1 Jan 2026, Sivtsov et al., 5 Jun 2025, Mucllari et al., 2 May 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Memory Transformers.