Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linformer: Efficient Linear Attention

Updated 20 April 2026
  • Linformer is a Transformer variant that approximates quadratic self-attention with low-rank, learnable projections to achieve linear complexity.
  • It uses projection matrices to reduce compute and memory from O(n²) to O(nk), maintaining accuracy in tasks like language modeling and classification.
  • Empirical evaluations demonstrate that Linformer achieves near-parity with standard Transformers while offering significant speed and memory improvements for long sequences.

Linformer is a linear-complexity variant of the Transformer architecture designed to address the computational and memory bottlenecks associated with the standard self-attention mechanism. Traditional self-attention incurs O(n2)O(n^2) time and space with respect to sequence length nn, making it inefficient for long sequences. Linformer approximates full self-attention as a low-rank matrix, introducing learned projection matrices that reduce both time and space complexity to O(nk)O(nk) per layer, where k≪nk \ll n. Empirical studies demonstrate that Linformer achieves accuracy on par with standard Transformers across language modeling, classification, and translation tasks, while dramatically increasing efficiency (Wang et al., 2020). Its paradigm shift—replacing quadratic-rank attention with provably linear-rank approximations—has positioned Linformer as a canonical approach in the landscape of efficient Transformers.

1. Self-Attention and Low-Rank Structure

In standard Transformers, the attention mechanism computes, for each input batch X∈Rn×dX\in \mathbb{R}^{n\times d} (token length nn, hidden size dd), query, key, and value matrices:

Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.

It forms an n×nn\times n attention matrix:

P=softmax(QK⊤d)P = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt d}\right)

and outputs nn0, which requires nn1 time and nn2 space due to dense nn3.

Linformer is motivated by the empirical observation that nn4 is numerically low-rank: its spectral mass is dominated by the top nn5 singular values. This can be formalized via the Eckart–Young–Mirsky theorem, which guarantees that the optimal rank-nn6 SVD approximation captures the majority of the variance in nn7. Empirical spectrum analysis on attention matrices from trained Transformers substantiates this low-rank property across practical tasks (Wang et al., 2020).

2. Linearization via Learned Projections

Rather than perform explicit SVD (which would be prohibitively expensive per layer), Linformer applies two learned projection matrices per attention head:

nn8

to reduce the effective sequence dimension:

nn9

Attention is then computed as:

O(nk)O(nk)0

O(nk)O(nk)1

This transforms both compute and storage cost from O(nk)O(nk)2 and O(nk)O(nk)3 to O(nk)O(nk)4 and O(nk)O(nk)5, respectively. When O(nk)O(nk)6 is small (e.g., O(nk)O(nk)7 for O(nk)O(nk)8), this yields practical linear complexity in O(nk)O(nk)9.

The projection matrices k≪nk \ll n0 and k≪nk \ll n1 are treated as learnable parameters. Empirically, sharing k≪nk \ll n2 and k≪nk \ll n3 across all heads or even all layers incurs negligible loss, simplifying the overall parameterization (Fournier et al., 2021).

3. Theoretical Guarantees and Approximation Properties

Linformer’s central theoretical contribution is the justification that, for appropriate k≪nk \ll n4, projecting k≪nk \ll n5 and k≪nk \ll n6 suffices to approximate standard attention with arbitrarily small error. The main theorems—rooted in randomized linear algebra—establish that with high probability, for k≪nk \ll n7, for any input, there exist (learnable or random) projections such that:

k≪nk \ll n8

and for all k≪nk \ll n9:

X∈Rn×dX\in \mathbb{R}^{n\times d}0

where X∈Rn×dX\in \mathbb{R}^{n\times d}1 is a Johnson–Lindenstrauss (JL)-type random sketch (Wang et al., 2020). Thus, the projection dimension X∈Rn×dX\in \mathbb{R}^{n\times d}2 can be taken polylogarithmic in X∈Rn×dX\in \mathbb{R}^{n\times d}3 for relative error X∈Rn×dX\in \mathbb{R}^{n\times d}4, or linear in X∈Rn×dX\in \mathbb{R}^{n\times d}5 for absolute X∈Rn×dX\in \mathbb{R}^{n\times d}6.

The implications are twofold:

  • The natively quadratic attention map is well-approximated by an X∈Rn×dX\in \mathbb{R}^{n\times d}7 structure.
  • Learned or random projections are sufficient; explicit SVD is unnecessary.

4. Empirical Evaluation and Practical Impact

Linformer’s empirical evaluation spans:

  • Masked language modeling (Wiki+BookCorpus): at X∈Rn×dX\in \mathbb{R}^{n\times d}8, Linformer with X∈Rn×dX\in \mathbb{R}^{n\times d}9 achieves validation perplexity nn0 vs. nn1 for the standard Transformer.
  • Downstream tasks (GLUE benchmarks): average dev accuracy is nn2 for Linformer (nn3), nn4 (shared projections), nn5 (layerwise sharing), and up to nn6 for nn7, closely matching or slightly exceeding RoBERTa-base (nn8 for nn9).
  • Speed and memory: At dd0 and dd1, Linformer runs dd2 faster and supports dd3 larger batches than vanilla Transformers; at dd4, speedup reaches dd5 with dd6 capacity gains (Wang et al., 2020).

Use cases are predominantly in encoder-only transformers for text/document classification, question answering, and any context where sequence length prohibits quadratic attention. Linformer is a straightforward drop-in: no changes to architectural components such as residuals, layer norms, or feed-forward layers are required (Fournier et al., 2021, Tay et al., 2020).

5. Strengths, Limitations, and Comparative Landscape

Strengths:

  • Linear per-layer compute and memory in sequence length, enabling efficient scaling to dd7.
  • Simplicity of implementation—a two-projection modification to the original architecture.
  • Minimal empirical tradeoff in accuracy for typical dd8–dd9 with Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.0 up to Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.1.

Limitations:

  • The entire approach relies on the low-rank hypothesis: if attention matrices are high-rank, Linformer degrades in representational power.
  • The method requires fixed maximum sequence length since Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.2 and Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.3 must be sized for the target Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.4; all inputs are padded/truncated accordingly.
  • Linformer does not natively support local (sliding window) or content-adaptive sparsity; its approximation is global and static.
  • Causal masking is not trivial, as length projections may mix positions, making Linformer less natural for decoder/generative contexts (Tay et al., 2020, Fournier et al., 2021).

Comparison to Alternatives:

  • Versus BigBird or Longformer, which use local attention and a few global tokens as proxies for content-adaptive sparsity, Linformer is strictly global but achieves greater memory efficiency for fixed-length contexts.
  • Compared to kernel-based linear transformers and SSMs, Linformer’s reliance on static projections limits its dynamic memory capabilities. Recent theoretical work (MetaLA) demonstrates Linformer cannot selectively forget—since its state update Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.5 admits no dynamic decay/gating—impacting performance on tasks requiring robust dynamic memory (Chou et al., 2024).

6. Variants, Theory Extensions, and Current Developments

Subsequent literature has explored variants that relax or adapt Linformer’s projection scheme:

  • Projection dimensions Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.6 may be chosen via data-dependent or random approaches; setting Q=XWQ,K=XWK,V=XWV,W∗∈Rd×d.Q = XW^Q,\quad K = XW^K,\quad V = XW^V,\quad W^* \in \mathbb{R}^{d \times d}.7 removes tuning but increases resource requirements (Verma, 2020).
  • Fixed vs. learned projections, and projection sharing across heads/layers, affect the parameter/memory tradeoff with typically minimal impact on accuracy.
  • Recent theory unifies Linformer with other linear-complexity attention mechanisms (e.g., kernel-based, state-space models), situating it as a case without dynamic state decay—a property now recognized as important for robust memory integration and universal approximation (Chou et al., 2024).
  • On memory-intensive synthetic benchmarks (e.g., Multi-Query Associative Recall), Linformer and its immediate descendants collapse, while models with learnable decay (e.g., MetaLA) succeed.

7. Summary Table: Linformer in Context

Model Class Dynamic Memory Static Approx. Parameter Efficiency Practical Efficiency Task Example
Linformer No Yes Moderate Very high Document classification
State Space Model (S4) Yes No High Very high Long-context modeling
MetaLA Yes Yes Maximal Very high MQAR, LRA, GLUE

Linformer remains a canonical instance of linear-rank self-attention, distinguished by its simplicity, efficiency, and theoretical assurances under the low-rank attention regime. Its limitations have become focal points for subsequent advances, particularly regarding selective memory and universal function approximation (Chou et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linformer.