Linear Self-Attention Approximation

Updated 21 September 2025

Linear self-attention approximation is a set of techniques that reformulate standard quadratic self-attention to achieve linear scaling for long sequences.
It leverages methods such as low-rank projections, kernel feature maps, and randomized features to balance computational efficiency with model accuracy.
Practical implementations of these techniques enable efficient transformer deployments in diverse domains including NLP, vision, and time series analysis.

Linear self-attention approximation refers to a class of techniques and mathematical frameworks that reformulate or approximate the quadratic-complexity self-attention mechanism in transformers to achieve time and/or space complexity that scales linearly (or near-linearly) with the sequence length. The fundamental motivation underlying these approaches is to enable efficient modeling and deployment of transformer architectures on long sequences in domains such as natural language, vision, audio, time series, and beyond, where quadratic scaling would otherwise be prohibitive.

1. Motivation and Problem Statement

Standard self-attention in transformers computes attention weights for every pair of tokens in the input sequence, resulting in $\mathcal{O}(n^2)$ time and memory complexity for a sequence of length $n$ . While this quadratic scaling is manageable for moderate sequence lengths, modern applications in language modeling, document processing, vision, or genomics routinely require handling thousands to tens of thousands of tokens. The cost inhibits not only training and inference efficiency but also the model’s ability to scale. Linear self-attention approximation seeks to address this bottleneck by leveraging mathematical or empirical properties—such as low-rank structure, kernelization, quantization, local-global decompositions, or randomized projections—to reduce the complexity to $\mathcal{O}(n)$ or $\mathcal{O}(n d^k), k\ll n$ , while ideally preserving the representational power and predictive performance of the original attention mechanism (Wang et al., 2020, Verma, 2020, Verma, 2021).

2. Foundational Methods and Mathematical Principles

Linear self-attention approximations are grounded in several key mathematical strategies:

Low-rank approximation: Based on empirical spectral analysis, the self-attention matrix is often observed to be numerically low-rank (Wang et al., 2020), enabling factorization into lower-dimensional projections. The Linformer, for instance, introduces learned projections $E$ and $F$ to map the $n\times d$ key and value matrices into $k\times d$ spaces, where $k\ll n$ , justifying this by tools such as the Johnson–Lindenstrauss lemma.

| Approach | Core Principle | Complexity | |-------------|---------------------------|----------------------| | Linformer | Low-rank projection | $\mathcal{O}(nk)$ | | Nyström | Column-based matrix approx | $\mathcal{O}(nc^2)$ |

Kernel and feature map methods: Reformulation of the softmax kernel $\exp(q^\top k)$ as a product of non-negative features (Yorsh et al., 2022, Nahshan et al., 2023):

$\exp(q^\top k) = \phi(q)^\top \phi(k)$

This allows reordering of computation so that sums over the sequence can be shared between all queries, yielding $\mathcal{O}(n)$ scaling for each kernelizable attention operation.

Random projection and randomized features: Approximating the exponential kernel via random Fourier or positive random features enables linear attention computations with controlled error (Zheng et al., 2022). LARA (Linear Randomized Attention) extends RFA by using multiple, input-adaptive proposals for improved expressiveness while remaining linear in complexity.
Polynomial (Taylor) approximation: The exponential Softmax kernel can be replaced by a truncated Taylor series expansion (Keles et al., 2022, Nauen et al., 5 Mar 2024), resulting in a self-attention mechanism amenable to linear-time computation by leveraging tensor product and factorization tricks. For instance:

$e^{q^\top k} \approx 1 + q^\top k + \frac{1}{2}(q^\top k)^2 + \cdots$

The TaylorShift approach (Nauen et al., 5 Mar 2024) demonstrates that under this approximation, full token-to-token interactions can be computed in linear time for long sequences.

Quantization and histograms: Approaches such as LISA (Wu et al., 2021) utilize vector quantization and codeword histograms to cluster tokens and aggregate attention interactions, achieving linear complexity via lookup and histogram accumulation rather than per-token computation.
Statistical distribution matching: Linear Log-Normal Attention (LLN) (Nahshan et al., 2023) explicitly matches the log-normal distributional properties and entropy/concentration of the attention matrix, leading to exponential feature maps parameterized to fit the variance structure of softmax attention.

3. Representative Algorithms and Architectural Modifications

Several representative linear self-attention mechanisms exemplify these approaches:

Linformer (Wang et al., 2020): Projects keys and values using learned matrices $E, F$ so self-attention becomes $softmax(Q(EK)^\top)VF$ , reducing time and memory to $\mathcal{O}(nk)$ for sequences of length $n$ .
Spectral Shifting (Verma, 2021): Refines Nyström approximations by representing the (softmax) attention matrix as $K \approx \tilde{C}U^{SS}\tilde{C}^\top + \delta^{SS}I$ , providing stronger error guarantees on the linearized operation.
LISA (Wu et al., 2021): Assigns each token to codebook entries, aggregates values as histograms, and replaces token-level interactions by codeword-histogram-level computations.
Random Feature Attention and LARA (Zheng et al., 2022): Use random features to linearize the kernel, with LARA leveraging multiple region-adaptive proposals to overcome bias and achieve accuracy close to true softmax attention.
TaylorShift (Nauen et al., 5 Mar 2024): Implements second-order (or higher) Taylor expansion of the softmax, reforms the quadratic term using the $\otimes$ tensor product (as $[(QK^\top)^{\odot 2}]_{ij} = [Q^{\otimes 2}]_i [K^{\otimes 2}]_j^\top$ ), and organizes attention computation as a series of contractions for $\mathcal{O}(n)$ scaling.
LLN Attention (Nahshan et al., 2023): Uses exponential feature maps $exp(\alpha q)$ , $exp(\beta k)$ with tunable parameters $\alpha, \beta$ determined by moment matching to ensure that the variance and concentration of LLN attention mimic those of canonical softmax.

4. Theoretical Guarantees and Trade-offs

The deployment of linear self-attention approximations comes with theoretical trade-offs:

Approximation error: There is a fundamental trade-off between computational efficiency and fidelity to full self-attention. Low-rank, random-feature, and polynomial schemes can control error by selecting rank $k$ , number of random features, or Taylor order $p$ , respectively. However, rigorous lower bounds (under SETH) suggest that any highly accurate attention computation will necessarily have quadratic complexity in the worst case (Keles et al., 2022).
Expressivity: Some approximations, especially naive kernelizations or low-rank projections with small $k$ , may dilute sharp (spiky) attention distributions, potentially degrading model expressivity for tasks requiring precise token-to-token interactions (Feng, 10 Jan 2025).
Statistical matching: Mechanisms such as LLN Attention attempt to address degradation by matching not only the pointwise moments but also the entropy and spectral gap of the attention distribution, thus better emulating the original attention’s concentration properties (Nahshan et al., 2023).
Universality: Deep theoretical work demonstrates that even shallow attention layers (with suitable parameterizations and grid-like interpolation) suffice for universal function approximation, both for continuous and Lebesgue integrable functions (Hu et al., 22 Apr 2025, Liu et al., 28 Apr 2025). These findings clarify that the expressive power of transformers can be retained even with minimal or modified attention mechanisms.

5. Practical Implementations and Empirical Evaluations

Practical evaluations consistently show that linear self-attention methods can deliver:

Substantial reductions in runtime and memory consumption: Across tasks such as masked language modeling, natural language inference, recommendation, and image and speech classification, methods such as Linformer (Wang et al., 2020) and TaylorShift (Nauen et al., 5 Mar 2024) achieve 1.5×–20× (or higher) speedups and up to 78× memory savings versus standard transformers for long sequences.
Competitive predictive performance: When hyperparameters (e.g., projection dimension $k$ , Taylor order, codebook size) are suitably chosen, models using approximated attention match or slightly exceed the performance of standard (quadratic) transformers on NLP, vision, and recommendation tasks (Wang et al., 2020, Wu et al., 2021, Nahshan et al., 2023, Nauen et al., 5 Mar 2024). Performance gaps typically appear only when extreme compression or aggressive approximation is used.
Efficient adaptation to various domains: Linear self-attention variants are implemented in diverse domains: speech recognition via global summaries (SummaryMixing) (Parcollet et al., 2023), efficient joint classification for hyperspectral and LiDAR data with plug-and-play linear attention (Feng et al., 2021), and vision transformers with linear-angular kernels (Castling-ViT) (You et al., 2022).

Empirical benchmarks underline the importance of architectural flexibility: parameter sharing, adaptive projection/kernels, and hybrid integration (e.g., combining global and local operators) can further boost both accuracy and efficiency.

6. Extensions, Limitations, and Future Directions

Linear self-attention research continues to evolve in several promising directions:

Input-dependent and adaptive projections: Empirical studies reveal that different attention heads and layers may require distinct ranks or projection dimensions (Wang et al., 2020); adaptively determining projection parameters based on data or layer depth may improve accuracy/compression trade-offs.
Generalized and trainable kernels: Universal approximating kernels instantiated by neural networks (e.g., feedforward or GLU layers) extend linear attention beyond pre-specified functional forms (Yorsh et al., 2022, Nahshan et al., 2023).
Extensions to in-context and algorithmic learning: Recent work elucidates how even simple modifications (e.g., addition of biases or mask-and-move operations in attention) allow transformer blocks to act as flexible algorithmic primitives, e.g., implementing skip connections or batch gradient descent steps (Hagiwara, 31 Mar 2025). These insights help explain in-context learning abilities in transformers (Hu et al., 22 Apr 2025, Liu et al., 28 Apr 2025).
Robustness and integration with new modalities: Hybrid schemes combine linear global modules with local refinement (e.g., block-diagonal augmentation in LLN+Diag (Nahshan et al., 2023) or depthwise convolution and masked attention in Castling-ViT (You et al., 2022)) to preserve both long-range and local dependencies.
Lower bounds and ultimate limitations: Theoretical results clarify that, except for additional structural assumptions, no linear (in $n$ ) attention scheme can guarantee arbitrarily low error in general—there is an exponential dependence on approximation parameters (polynomial order, rank, etc.) if high accuracy is required (Keles et al., 2022).

7. Applications and Impact

Linear self-attention approximations have significantly extended the practical reach of transformer models:

Scalability: Linear complexity allows transformers to be deployed on previously intractable tasks—very long document modeling, long-horizon video and audio analysis, high-resolution vision, and real-time systems.
Green AI: Reducing memory and computation translates directly to lower energy consumption and environmental impact, which has become a consideration in the deployment and scaling of language and vision models.
Algorithmic flexibility and universality: The minimal architectural requirements for universality (as established in recent work) open the door to efficient, interpretable, and highly flexible network designs, both for standard machine learning tasks and emerging areas like algorithmic reasoning or modular transformers.

In summary, linear self-attention approximation encompasses a rich set of mathematical ideas and practical algorithms for making transformers efficient on long sequences without conceding substantial predictive power. Ongoing research is sharpening the theoretical understanding and further enlarging the toolkit for high-performance, scalable sequence modeling across scientific and industrial domains.