Papers
Topics
Authors
Recent
2000 character limit reached

Reformer: Efficient Transformer Variant

Updated 19 December 2025
  • Reformer architecture is a Transformer variant that reduces memory and computation by employing locality-sensitive hashing and reversible residual layers.
  • It replaces quadratic self-attention with a bucketed sparse mechanism, enabling efficient processing of long input sequences while balancing performance tradeoffs.
  • Its design shows promising scalability in vision and sequence tasks, though practical benefits may vary compared to dense attention models.

The Reformer architecture is a Transformer variant designed to reduce the computational and memory overhead associated with standard self-attention and deep residual stacking, especially on long input sequences. It incorporates locality-sensitive hashing (LSH) to sparsify attention computation and reversible residual layers to minimize activation storage. These innovations enable efficient scaling to input lengths and model depths that are prohibitive for conventional Transformer designs (Kitaev et al., 2020).

1. Architectural Foundations and Pipeline

Reformer processes input sequences by first embedding tokens and adding positional encodings. For vision tasks, the input image of size H×WH \times W is divided into non-overlapping patches of size P×PP \times P, producing n=(H/P)(W/P)n = (H/P) \cdot (W/P) tokens. Each patch is linearly projected into an embedding space of dimension DD, and lightweight convolutional stem modules (e.g., 3×3 convolution layers) may be used prior to patching to inject local inductive bias (Bellaj et al., 12 Dec 2025). In natural language settings, learnable token embeddings and fixed sinusoidal positional encodings are employed (Kitaev et al., 2020).

The architecture consists of stacked blocks, each executing attention and feed-forward sublayers. In the original Reformer, blocks are organized in a reversible fashion to enable memory-efficient training. Vision Reformer variants may omit reversible layers, focusing on LSH attention (Bellaj et al., 12 Dec 2025).

2. Locality-Sensitive Hashing Attention

Traditional self-attention has quadratic complexity in the sequence length (O(n2)O(n^2)), restricting scalability for long input sequences. Reformer replaces global attention with a bucketed, sparse attention mechanism based on LSH (Kitaev et al., 2020, Bellaj et al., 12 Dec 2025).

Mechanism

  • Hashing and Bucket Assignment: For each token embedding xiRDx_i \in \mathbb{R}^D, a random projection matrix RRD×BR \in \mathbb{R}^{D \times B} generates ui=xiRu_i = x_i \cdot R. Bucket IDs may be computed as h(xi)=argmaxjuijh(x_i) = \operatorname{argmax}_j |u_{ij}| or as the sign vector h(xi)=sign(ui){±1}Bh(x_i) = \operatorname{sign}(u_i) \in \{\pm 1\}^B. Tokens with identical hash are grouped into the same bucket (Bellaj et al., 12 Dec 2025). In the original Reformer, “angular” LSH is used: h(x)=argmax1ib[xR;xR]ih(x) = \arg\max_{1 \leq i \leq b} [xR; -xR]_i (Kitaev et al., 2020).
  • Sorting and Chunking: The sequence is sorted according to bucket IDs, and split into Mn/bucket sizeM \approx n/\text{bucket size} contiguous buckets. Optionally, buckets can attend to neighboring buckets to mitigate boundary errors (Bellaj et al., 12 Dec 2025).
  • Bucketed Attention: Within each bucket mm, attention is computed among tokens Qm,Km,VmRb×dQ^m, K^m, V^m \in \mathbb{R}^{b \times d}:

Attentionm(Qm,Km,Vm)=softmax(QmKmd)Vm\operatorname{Attention}^m(Q^m, K^m, V^m) = \operatorname{softmax}\left(\frac{Q^m {K^m}^\top}{\sqrt{d}}\right) V^m

Outputs are re-stitched to restore original token ordering (Bellaj et al., 12 Dec 2025). Multiple rounds of hashing can be used to reduce missed similarities at the cost of increased compute (Kitaev et al., 2020).

Complexity

  • Hashing and projection: O(nDB)O(nD B) (constant for fixed D,BD, B).
  • Sorting buckets: O(nlogn)O(n \log n).
  • Attention per bucket: O(nb)O(n b) for bucket size bb.
  • Overall time: O(nlogn)O(n \log n) with constant bucket size.
  • Attention memory per layer: O(nb)O(n b) instead of O(n2)O(n^2) (Bellaj et al., 12 Dec 2025).

3. Reversible Residual Layers and Memory Efficiency

Reformer leverages reversible blocks to limit activation storage. Each block computes with two streams (x1,x2)(x_1, x_2): \begin{align*} y_1 &= x_1 + F(x_2) \ y_2 &= x_2 + G(y_1) \end{align*} Backpropagation inverts these equations to recover intermediate activations on the fly: \begin{align*} x_2 &= y_2 - G(y_1) \ x_1 &= y_1 - F(x_2) \end{align*} Standard Transformers require storing all layer activations (O(NLdmodel)O(N \cdot L \cdot d_{model})), whereas Reformer only retains final outputs (O(Ldmodel)O(L \cdot d_{model})) (Kitaev et al., 2020).

Feed-forward chunking further decomposes memory: the input sequence is processed in cc chunks, limiting active storage to O((L/c)dff)O((L/c) d_{ff}) per chunk (Kitaev et al., 2020).

4. Complexity Analysis and Tradeoffs

Comparison of self-attention types is summarized as follows:

Attention Type Time Complexity Memory Complexity
Vanilla (ViT) O(n2d)O(n^2 d) O(n2)O(n^2)
LSH (Reformer) O(nlogn+nd)O(n \log n + n d) O(nb)O(n b) per head

For fixed dd and bb, LSH’s complexity simplifies to O(nlogn)O(n \log n). The efficiency benefits become significant only for extremely long sequences; GPU implementations of dense attention are generally superior in short to mid-length regimes (e.g., n256n \approx 256) (Bellaj et al., 12 Dec 2025). In such cases, hashing, sorting, and bucket assignment overheads can outweigh the theoretical advantage (Bellaj et al., 12 Dec 2025). For example, the Reformer matches ViT’s epoch time only for n=784n=784 (high-resolution medical images), with persistent accuracy deficits (Bellaj et al., 12 Dec 2025).

5. Experimental Findings in Vision and Sequence Modeling

Vision-specific evaluations (Bellaj et al., 12 Dec 2025) and general sequence modeling results (Kitaev et al., 2020) report:

Vision Tasks (Image Patch Tokenization)

  • CIFAR-10 (n=256):
    • Reformer accuracy: 86.95%; ViT: 87.90%
    • Epoch time: Reformer 165s; ViT 130s
  • ImageNet-100 (n=256):
    • Reformer top-1: 74.20%; ViT: 76.70%
    • Epoch time: Reformer 517s; ViT 363s
  • Diabetic Retinopathy (n=784):
    • Reformer top-1: 70.96% (AUC 88.7%); ViT: 74.56% (AUC 90.9%)
    • Epoch time: Reformer 1430s; ViT 1439s
    • Macro-precision/recall/F1: lower for Reformer, especially on minority classes

ViT consistently outperforms Reformer in accuracy and efficiency up to moderately long input lengths. A plausible implication is that the theoretical O(nlogn)O(n\log n) advantage is mostly realized for sequence counts (n784n\gg 784) far larger than typical 2D image patch ensembles.

Sequence Modeling

  • On synthetic 1024-token duplicate-sequence tasks, 4–8 rounds of LSH attention achieve accuracy indistinguishable from full attention (Kitaev et al., 2020).
  • Large-scale models (e.g., 12-20 layers, 64K tokens) can be trained on a single accelerator, yielding competitive bits/dim metrics on enwik8, which are unattainable for standard Transformers due to memory constraints (Kitaev et al., 2020).

6. Approximation Errors, Limitations, and Prospects

LSH attention’s bucketization can misassign semantically similar tokens, undermining performance on tasks requiring fine global context (e.g., minority class recognition or subtle global dependencies in vision) (Bellaj et al., 12 Dec 2025). Increasing LSH rounds decreases these misses, at linearly increased computational cost (Kitaev et al., 2020).

Even when Reformer matches ViT’s runtime at higher nn, its accuracy often lags, especially for imbalanced or fine-grained recognition. The effectiveness of Reformer is thus contingent upon sufficiently large nn for the complexity savings to surpass dense attention’s GPU optimization (Bellaj et al., 12 Dec 2025).

Future directions are proposed: learnable or data-dependent hashing to improve bucketization, hybrid hierarchical tokenization, token pruning, and enhanced strategies for class imbalance (e.g., focal loss adaptation) (Bellaj et al., 12 Dec 2025).

7. Impact and Research Trajectory

Reformer demonstrates that memory- and time-efficient Transformer designs are feasible for long sequences in textual, genomic, or extremely high-resolution visual domains. Its primary impact lies in scaling deep models where quadratic attention is a bottleneck and enabling training of models that surpass hardware memory limitations. However, for vision tasks within currently standard patch-token regimes or moderately long sequences, the practical benefits can be marginal or even negative relative to dense attention baselines (Bellaj et al., 12 Dec 2025).

Current research interest centers on developing hybrid mechanisms that mitigate accuracy tradeoffs, learning data-dependent hash functions, integration with hierarchical or multi-scale tokenization methods, and extending Reformer-type models to video, volumetric imaging, and ultra-long document tasks. These directions target the reconciliation of theoretical scalability with empirical accuracy across diverse domains (Kitaev et al., 2020, Bellaj et al., 12 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Reformer Architecture.