Linearized Attention Mechanisms

Updated 7 October 2025

Linearized attention mechanisms are methods that replace the quadratic softmax computation with feature map approximations for linear scaling in long-sequence tasks.
They utilize kernel methods, polynomial approximations, and random feature techniques to efficiently summarize and process large input sequences.
While offering significant speed and memory benefits, these methods may incur trade-offs in accuracy and selectivity compared to traditional softmax attention.

Linearized attention mechanisms refer to a diverse class of methods that seek to reduce the quadratic time and memory complexity of standard softmax-based attention by restructuring the underlying computations to have linear dependence on the sequence length. These techniques are primarily motivated by the scalability challenges posed by conventional attention in modeling long sequences or supporting high-throughput, resource-constrained scenarios. The development of linearized attention builds upon insights from kernel methods, polynomial approximations, incremental statistics, and efficient hardware-aware design. While their computational advantages are clear, linearized attention mechanisms present nuanced trade-offs in terms of accuracy, representation quality, and applicability across domains.

1. Fundamental Principles and Mathematical Formulations

The foundational premise of linearized attention is to approximate the softmax kernel $\exp(q^\top k)$ with feature maps or alternative similarity functions that admit efficient summation and distributive properties. In the standard (softmax) attention setting, for query $q \in \mathbb{R}^d$ , key $k_j \in \mathbb{R}^d$ , and values $v_j \in \mathbb{R}^d$ , the output is

$\text{Attn}(q, \{k_j\}, \{v_j\}) = \frac{\sum_j \exp(q^\top k_j) v_j}{\sum_j \exp(q^\top k_j)}$

which for a document of $n$ keys/events, incurs $O(n d)$ computation and storage per query.

In linearized attention (Brébisson et al., 2016), the softmax non-linearity is removed or replaced with features $\phi(\cdot)$ so the computation becomes

$\text{LinearAttn}(q, \{k_j\}, \{v_j\}) = \frac{\phi(q)^\top \sum_j \phi(k_j)v_j}{\phi(q)^\top \sum_j \phi(k_j)}$

This allows factoring out the expensive pairwise similarity matrix in favor of two summations—one over $\phi(k_j) v_j$ and another over $\phi(k_j)$ —each computed in $O(nd)$ time, and reused across different queries. For especially efficient encoding, linearized attention often accumulates statistics of the input via incremental updates, e.g., using covariance matrices $C = H^\top H$ (Brébisson et al., 2016) or kernel-weighted sums in various feature spaces.

Approximations to the exponential kernel are also developed using Taylor expansions, as in (Li et al., 2020) and (Mercat, 2020), leveraging expansions such as

$\exp(x) \approx 1 + x + \frac{1}{2} x^2$

to trade off fidelity versus efficiency. More sophisticated random feature approaches, for example, Performers and LARA (Zheng et al., 2022), directly approximate $\exp(q^\top k)$ via Monte Carlo sums over randomized positive features. Methods such as Local Linear Attention (Zuo et al., 1 Oct 2025) reinterpret the attention as a local linear regression, affording improved bias–variance properties and boundary adaptation compared to the Nadaraya–Watson estimator implicit in Softmax Attention.

2. Computational Advantages and Scaling Properties

The principal computational benefits of linearized attention mechanisms are twofold:

Time Complexity: Lookup or update cost per query is reduced from $O(n d)$ or $O(n^2 d)$ to $O(d^2)$ or even $O(d)$ in many cases, independent of the sequence/document length $n$ (Brébisson et al., 2016, Li et al., 2020). When applied to deep neural architectures (e.g., transformers), these properties drastically reduce memory use and wall-clock time for long-context inference and batch processing.
Memory Efficiency: By projecting or summarizing sequence statistics into fixed-size state or intermediate moments, it is possible to encode a document or sequence using $O(d^2)$ or $O(d)$ memory rather than $O(nd)$ . This fixed-size representation is critical for deployment in environments with stringent memory constraints.

These effects are especially pronounced in settings such as large-scale question answering, information retrieval, and high-resolution image/video modeling, where either (a) the context window far exceeds the embedding dimension, or (b) real-time servicing requires amortized computation per query (Brébisson et al., 2016, Li et al., 2020, Zuo et al., 1 Oct 2025).

3. Variants, Extensions, and Methodological Innovations

Linearized attention models encompass a growing family of architectural flavors:

Variant / Extension	Core Idea	Reference
Gated Linear Attention	Integrates learned gates to improve selectivity and partially restore the nonlinear softmax effect	(Brébisson et al., 2016)
Random Feature Attention	Approximates the softmax kernel by randomizing positive feature maps (e.g., Performer, LARA, RA)	(Zheng et al., 2022)
Higher-Order Approximations	Utilizes higher-order Taylor expansions to more faithfully approximate the exponential kernel	(Mercat, 2020)
Agglomerative / Grouped	Clusters elements or uses soft class assignment to summarize sequence content efficiently	(Spellings, 2019)
Local Linear Regression	Realizes non-constant fits to key–value pairs to reduce boundary bias and improve bias–variance trade-offs	(Zuo et al., 1 Oct 2025)
Modular and Hybrid Schemes	Dynamically mixes linearized and softmax blocks for improved quality–efficiency trade-offs	(Agostinelli et al., 2023)
Element-wise Attention	Replaces dot-product similarity with element-wise (channel-wise) distance, approximates exponential via Taylor poly.	(Feng, 10 Jan 2025)
Interactive Cross-Head	Decomposes attention maps and introduces cross-head connections for higher expressivity at linear cost	(Kang et al., 27 Feb 2024)

Some methods extend the kernel feature space (e.g., exponential, ELU, log-normal (Nahshan et al., 2023)), employ dynamically adapted proposals for importance sampling (Zheng et al., 2022), or interpolate between linear and softmax attention via statistical regression principles (Zuo et al., 1 Oct 2025).

4. Empirical Performance and Applications

Linearized attention has been empirically validated across a range of domains:

Natural Language Processing: On benchmarks such as CNN QA, WikiText-2, and GLUE, linearized attention mechanisms (augmented by gates or higher-order approximations where necessary) improve significantly over models with no attention, offer considerable speedup, and retain competitive accuracy when compared with softmax baselines—though a performance gap remains (Brébisson et al., 2016, Nahshan et al., 2023, Agostinelli et al., 2023).
Vision and Segmentation: In high-resolution semantic segmentation (e.g., with GID dataset), linear attention modules integrated into various CNN backbones (e.g., U-Net, DeepLab) consistently outperform their non-attention counterparts and offer substantial efficiency gains (Li et al., 2020).
Long-Sequence Modeling and LLMs: Surveys document the adoption of linear (kernelized, recurrent, fast-weight) attention in modern LLMs and hybrid configurations, with scalability to much longer contexts and efficient test-time adaptation (Sun et al., 25 Jul 2025). Recent work further demonstrates practical gains in autoregressive tasks when paired with suitable augmentations that respect causal constraints (You et al., 11 Jun 2024).
Associative Memory and In-Context Learning: Tasks requiring adaptation to non-stationary or piecewise linear context (e.g. permutation state tracking, associative recall) benefit from the flexible interpolation capacity of mechanisms like local linear attention, which outperform global linear and classical softmax estimators in certain non-stationary regimes (Zuo et al., 1 Oct 2025).

5. Limitations and Trade-Offs

While advantageous in efficiency, linearized attention mechanisms are not universally superior to softmax attention:

Accuracy Gap: Empirical and theoretical results indicate that the removal or replacement of the softmax non-linearity—while offering significant speedups—generally produces somewhat inferior accuracy compared to the standard softmax, especially for tasks requiring highly selective (spiky) attention distributions (Brébisson et al., 2016, Duranthon et al., 26 Sep 2025).
Representation Compression: Compact summaries (e.g., $k\times k$ matrices) may fail to preserve detailed information about very long sequences, especially when context-specific interactions are crucial (Brébisson et al., 2016, Spellings, 2019).
Selectivity and Spikiness: Linear attention tends toward more diffuse, less “selective” association between queries and keys. Theoretical analysis shows that only the softmax operator guarantees Bayes-optimality in high-dimensional selection problems and that the performance of linear attention fundamentally degrades with sequence length or insufficient “signal” (Duranthon et al., 26 Sep 2025).
Applicability Regimes: Linearized attention is preferable when the sequence length $n \gg k$ , or in scenarios with extreme query loads and memory restrictions. For short sequences or low-query regimes, the savings may not offset the marginal loss in accuracy.

6. Theoretical Insights and Universality

The expressivity and universality of linearized attention mechanisms have been studied from multiple perspectives:

Capacity and Circuit Depth: Additive and multiplicative “quark” primitives allow shallow (linear or single-layer) circuits to efficiently simulate sparse quadratic or piecewise linear functions, reducing the need for deeper architectures (Baldi et al., 2022).
Universal Approximation: The ability of the (softmax) attention module to perform near-argmax “interpolation” allows even two-layer attention-only networks to universally approximate continuous sequence-to-sequence functions. For linearized attention mechanisms, it is necessary to ensure that their selectivity properties sufficiently match those of softmax, or else universal approximation is lost (Hu et al., 22 Apr 2025).
Statistical Optimality: Only softmax (and a narrow class of “argmax-equivalent” nonlinearities) achieve the Nishimori condition necessary for explicit Bayes-optimality in regression selection models, with linear attention shown to be suboptimal, especially under high-dimensional scaling (Duranthon et al., 26 Sep 2025).
In-Context Learning and Weight Conversion: In linearized attention models, the context provided by in-context learning tokens can be exactly “converted” into model weights via bias terms in the key–value accumulators, allowing explicit storage of transient context in model parameters (Chen et al., 5 Jun 2024). This conversion is only approximate for softmax-based transformers but is formally exact in the linearized (kernel) setting.

7. Hardware, Implementation, and Future Directions

A core advantage of linearized attention lies in its amenability to hardware acceleration and parallelization:

Chunkwise, Blockwise, and Layerwise Primitives: Techniques such as blockwise updates, custom kernels (e.g., FlashLLA (Zuo et al., 1 Oct 2025)), recurrent and chunkwise representations, and reordering of computation are critical for harnessing GPU/TPU accelerators and reducing practical memory bottlenecks (Sun et al., 25 Jul 2025, Zuo et al., 1 Oct 2025).
Hybrid and Modular Architectures: By modularizing the attention mechanism at the block or head level, it becomes possible to partition efficient attention variants according to utility (e.g., linear attention for global context, softmax or local convolution for local and detailed patterns) (Agostinelli et al., 2023).
Ongoing Research: There is focus on bridging the accuracy–efficiency gap by developing higher-fidelity kernel approximations, adaptive proposal distributions, dynamic sparsity, and more flexible bias–variance interpolations. The applicability of such methods is also broadening, with recent progress in test-time regression, time series, reinforcement learning, and non-stationary sequence domains (Zuo et al., 1 Oct 2025, Sun et al., 25 Jul 2025).
Limitations: The performance of linearized mechanisms remains bounded by their approximation quality to the original softmax, and by the challenge of encoding/decompressing long-range and context-specific dependencies.

Linearized attention mechanisms represent a rich and rapidly evolving area at the intersection of algorithmic efficiency, functional approximation, and large-scale deployment. While offering compelling solutions to the bottlenecks of quadratic attention, their use requires careful consideration of task requirements, data properties, and hardware constraints, as well as an understanding of the core trade-offs between efficiency, selectivity, and expressive power.