Papers
Topics
Authors
Recent
Search
2000 character limit reached

RNNFormer Block Architecture

Updated 4 March 2026
  • RNNFormer Block is a neural module that combines transformer self-attention with recurrent gating, enabling block-parallel processing with linear computational complexity.
  • It processes blocks of tokens in parallel using vertical self- and cross-attention while employing horizontal recurrence over state vectors to capture long-range dependencies.
  • Empirical results demonstrate improved language modeling performance and reduced bits-per-token on benchmarks, making it a scalable alternative to standard transformers.

The RNNFormer block, also known as the Block-Recurrent Transformer cell, is a neural architecture designed to combine the strengths of transformer layers and recurrent architectures for sequence modeling. Unlike conventional transformers, which have quadratic complexity in sequence length, and standard LSTMs, which process one token at a time, the RNNFormer processes blocks of tokens in parallel while maintaining a recurrent state over blocks. This approach yields linear complexity with respect to sequence length and is capable of efficiently leveraging accelerator hardware. The RNNFormer block utilizes both self-attention and cross-attention mechanisms within and across token blocks, as well as LSTM-style or highway-style gating for state updates, resulting in enhanced modeling capacity over long contexts and improved empirical performance on language modeling tasks (Hutchins et al., 2022).

1. RNNFormer Block Architecture

The RNNFormer block operates on non-overlapping blocks of tokens of length WW. At each block index tt, the model maintains a set of SS state vectors stRS×ds_t \in \mathbb{R}^{S \times d} and receives block token embeddings etRW×de_t \in \mathbb{R}^{W \times d}. The architecture executes two principal computational passes within every block:

  • Vertical pass (within-block, parallel over tokens):
    • Self-attention among WW token embeddings.
    • Cross-attention from tokens to the persistent state vectors.
    • Concatenation of both attention outputs, followed by projection and a small MLP.
    • (Optionally) A vertical gate and residual connection to update the token outputs.
  • Horizontal pass (across blocks, recurrent over state):
    • Self-attention among the SS state vectors.
    • Cross-attention from states to the current token block embeddings.
    • Concatenation, projection, MLP, and an LSTM-style or highway-style gate to produce the next block’s state st+1s_{t+1}.

This block is structurally derived by rotating the transformer layer architecture 90°, replacing residuals in the horizontal direction with gates, and unrolling across blocks to propagate long-term dependencies.

2. Precise Formulation and Gating Mechanisms

2.1 Attention Computations

Let dd denote the hidden dimension. The model shares key and value projections between vertical and horizontal passes but maintains separate query projections. At each block tt, the following operations are performed (see equations (1)–(4) in (Hutchins et al., 2022)):

  • Token self-attention (vertical):

Aev=Softmax(Qev(Kev)+RelPosBiasd)VevRW×dA_e^v = \mathrm{Softmax}\left(\frac{Q_e^v (K_e^v)^\top + \mathrm{RelPosBias}}{\sqrt{d}}\right)V_e^v \in \mathbb{R}^{W \times d}

  • Token \to State cross-attention (vertical):

Cev=Softmax(Qev(Ksv)d)VsvRW×dC_e^v = \mathrm{Softmax}\left(\frac{Q_e^v (K_s^v)^\top}{\sqrt{d}}\right)V_s^v \in \mathbb{R}^{W \times d}

  • State self-attention (horizontal):

Ash=Softmax(Qsh(Ksh)d)VshRS×dA_s^h = \mathrm{Softmax}\left(\frac{Q_s^h (K_s^h)^\top}{\sqrt{d}}\right)V_s^h \in \mathbb{R}^{S \times d}

  • State \to Token cross-attention (horizontal):

Csh=Softmax(Qsh(Keh)d)VehRS×dC_s^h = \mathrm{Softmax}\left(\frac{Q_s^h (K_e^h)^\top}{\sqrt{d}}\right)V_e^h \in \mathbb{R}^{S \times d}

The outputs are concatenated: [Aev,Cev]RW×2d[A_e^v, C_e^v] \in \mathbb{R}^{W \times 2d} and [Ash,Csh]RS×2d[A_s^h, C_s^h] \in \mathbb{R}^{S \times 2d}, then projected and passed to MLPs.

2.2 Gating Updates

The core state update applies either an LSTM-style gating mechanism or a fixed highway-gate:

  • LSTM-style gating (Eq. 5):

u~t=tanh(Wzut+bz) it=σ(Wiut+bi1) ft=σ(Wfut+bf+1) ct+1=ftct+itu~t st+1=ct+1\begin{aligned} \tilde{u}_t &= \tanh(W_z u_t + b_z) \ i_t &= \sigma(W_i u_t + b_i - 1) \ f_t &= \sigma(W_f u_t + b_f + 1) \ c_{t+1} &= f_t \odot c_t + i_t \odot \tilde{u}_t \ s_{t+1} &= c_{t+1} \end{aligned}

where σ\sigma is sigmoid, \odot is elementwise multiplication.

  • Fixed-gate (highway-style, Eq. 6):

zt=Wzut+bz gt=σ(Wgut+bg) st+1=gtst+(1gt)zt\begin{aligned} z_t &= W_z u_t + b_z \ g_t &= \sigma(W_g u_t + b_g) \ s_{t+1} &= g_t \odot s_t + (1-g_t) \odot z_t \end{aligned}

2.3 Recurrence Relation

The overall cell recurrence is:

st+1=RNNCell(st,et),et+1vertical passet+1+MLP([Aev,Cev])s_{t+1} = \mathrm{RNNCell}(s_t, e_t), \quad e_{t+1} \overset{\text{vertical pass}}{\longleftarrow} e_{t+1} + \mathrm{MLP}([A_e^v, C_e^v])

3. Algorithmic Workflow

The RNNFormer cell is processed over a sequence partitioned into N/WN/W blocks, with pseudo-code as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Inputs: token blocks {e₁,…,e_{N/W} ∈ ℝ^{W×d}, initial state s₁ ∈ ℝ^{S×d}}
For t = 1 … N/W do
  # Vertical pass (parallel over W tokens)
  Qev,Kev,Vev ← Proj_e^v(eₜ)
  Qsv,Ksv,Vsv ← Proj_s^v(sₜ)
  Aev = Softmax((Qev·Kevᵀ + RelPosBias)/√d)·Vev
  Cev = Softmax((Qev·Ksvᵀ)/√d)·Vsv
  yₜ = eₜ + Gate_v( MLP_v(concat[Aev,Cev]) )

  # Horizontal (recurrent) pass
  Qsh,Ksh,Vsh ← Proj_s^h(sₜ)
  Qeh,Keh,Veh ← Proj_e^h(eₜ)
  Ash = Softmax((Qsh·Kshᵀ)/√d)·Vsh
  Csh = Softmax((Qsh·Kehᵀ)/√d)·Veh
  uₜ = MLP_h( concat[Ash,Csh] )
  s_{t+1} = LSTM_Gate(uₜ, cₜ)
EndFor
Cache s_{N/W+1} for next segment

At inference, block-level processing enables autoregressive decoding with cached recurrent and transformer key–value states.

4. Computational Complexity

For a sequence of length NN, block size WW, and SS state vectors:

  • Within each block:
    • Token self-attention: O(W2d)O(W^2 d)
    • State self-attention: O(S2d)O(S^2 d)
    • Cross-attention (token\leftrightarrowstate): O(2SWd)O(2SWd)
  • Total for N/WN/W blocks:

O((W2+S2+2SW)d(N/W))\mathcal{O}\bigl((W^2 + S^2 + 2SW)d \cdot (N/W)\bigr)

With SWS \approx W, per-block complexity is O(2W2d)O(2W^2 d); total complexity is O(2WNd)O(2WN d), linear in NN. For comparison, a standard transformer layer incurs O(N2d)O(N^2 d) cost.

5. Comparison to Transformer and LSTM Architectures

Model Parallelism Memory/State Complexity
Transformer Fully parallel Key–value cache only O(N2d)O(N^2 d)
LSTM Sequential Single dd-dim vector Linear but compresses all
RNNFormer Block Block-parallel S1S \gg 1 vectors O(WNd)O(W N d), linear

Parameter count and FLOPs for a 12-layer Block-Recurrent model are nearly equivalent to a 13-layer Transformer-XL, as the block-recurrent cell adds the same number of projection/MLP parameters as an additional transformer layer.

The RNNFormer block enables:

  • Persistent state vectors carried across blocks (much larger than LSTM memory).
  • Parallel token processing within blocks.
  • Long-context modeling with computational cost linear in sequence length.

6. Empirical Performance and Scaling

On language modeling benchmarks (PG19, arXiv, GitHub), the "Rec:fixed:skip" Block-Recurrent configuration demonstrates significant reductions in bits-per-token (log2_2 perplexity) compared to Transformer-XL baselines—at equal step-time and parameter budget.

Selected Results:

Model Seg. len Win. PG19 tok. arXiv tok. GitHub tok. Rel. step time
XL:512 512 512 3.62 1.45 1.21 -
XL:2048 2048 2048 3.58 1.31 1.01 -
Slide:13L 4096 512 3.58 1.42 1.17 1.00
Rec:fixed:skip 4096 512 3.53 1.24 0.976 1.00

On PG19, adding recurrence reduces bits-per-token from 3.58 to 3.53. On arXiv tokens, perplexity drops from 1.42 to 1.24, and on GitHub tokens from 1.17 to 0.976, all at comparable computation and parameter count. Scaling studies indicate that recurrence affords improvements commensurate with doubling the model’s parameter count across multiple model sizes (Hutchins et al., 2022).

7. Contextual Impact and Implementation

The RNNFormer block achieves efficient long-context modeling without the quadratic overhead of full-sequence attention, and can be directly implemented by modifying conventional transformer layer code—suggesting practical adoption routes for large-scale sequence modeling tasks requiring persistent memory and linear scalability. The approach was demonstrated effective on book, scientific paper, and source code corpora with open-source implementation available (Hutchins et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RNNFormer Block.