Linear Attention Reformulation

Updated 4 October 2025

Linear attention reformulation is a set of techniques that reorders matrix computations using associativity, kernel mappings, or Taylor approximations to reduce quadratic complexity.
It restructures the standard dot-product attention to aggregate key–value products before interacting with queries, significantly lowering memory and compute demands.
Practical implementations show substantial speedups and reduced resource usage in applications ranging from high-resolution vision to large-scale language models.

Linear attention reformulation encompasses a family of techniques designed to reduce the computational and memory complexity of attention mechanisms in neural networks, particularly transformers, from quadratic to linear (or near-linear) with respect to input sequence length or resolution. Unlike conventional dot-product or softmax attention—which scales as $O(N^2)$ for $N$ positions—linear attention reformulations exploit mathematical properties such as associativity, kernel feature mappings, and normalization reordering. The result is algorithms that either exactly or approximately preserve global context modeling at substantially reduced resource cost, enabling practical deployment for long sequences, high-resolution vision, and other resource-intensive scenarios.

1. Mathematical Principles of Linear Attention Reformulation

The foundational concept in linear attention reformulation is the reordering of matrix operations enabled by associativity. The classical attention can be written as

$D(Q, K, V) = \rho(QK^\top) V,$

where $Q$ , $K$ , $V$ are the query, key, and value matrices and $\rho$ is a normalization function, often softmax or scaling. In standard form, computation of the $N \times N$ similarity matrix $QK^\top$ leads to $O(N^2)$ complexity.

Efficient attention (Shen et al., 2018) reformulates this as

$E(Q, K, V) = \rho_q(Q) [\rho_k(K)^\top V],$

where $\rho_q$ and $\rho_k$ are normalizations applied independently. With scaling normalization, $E(Q, K, V) = (1/n) Q (K^\top V)$ is mathematically equivalent to standard attention, reducing complexity to $O(N)$ and bypassing explicit affinity matrix computation.

Kernel-based approaches (Katharopoulos et al., 2020) generalize by using a similarity function $sim(q, k) = \phi(q)^\top\phi(k)$ , where $\phi(\cdot)$ is a non-negative feature map. This enables the numerator and denominator in attention computations to be factored into aggregated sums:

$V_i' = \frac{\phi(Q_i)^\top S}{\phi(Q_i)^\top Z}, \quad S = \sum_{j=1}^N \phi(K_j) V_j, \quad Z = \sum_{j=1}^N \phi(K_j).$

Other reformulations employ Taylor expansions of the exponential in softmax normalization, e.g.,

$\exp(x) \approx 1 + x \; \text{(first-order)}, \quad \exp(x) \approx 1 + x + \frac{x^2}{2} \; \text{(second-order)}.$

Higher order expansions can closely mimic softmax while maintaining linear or log-linear cost if tensor contractions are properly reordered (Mercat, 2020, Nauen et al., 5 Mar 2024, Guo et al., 5 Jun 2025).

2. Algorithmic Efficiency and Scaling Behavior

Linear attention reformulations fundamentally alter algorithmic scaling:

Method	Memory Complexity	Compute Complexity	Equivalent to Softmax
Dot-product (vanilla)	$O(N^2)$	$O(N^2)$	Yes
Efficient attention	$O(N)$	$O(N)$	Yes (scaling)
Kernel-feature	$O(N)$	$O(N)$	Approximate (depends)
Taylor expansions	$O(N)^\ast$	$O(N)^\ast$	Approximate (order)
Log-linear attention	$O(\log N)$	$O(N\log N)$	More expressive

$^\ast$ Efficient implementation requires careful tensor reordering and feature dimension management.

For quadratic cost ( $O(N^2)$ ), increasing input length quickly depletes memory. Linear reformulations avoid storing $N \times N$ intermediates by (a) moving left/right matrix multiplications, and (b) summarizing key–value products before interacting with query. In Taylor-based variants (Nauen et al., 5 Mar 2024, Mercat, 2020), expansion and contraction of higher-order terms are performed via tensor operations amenable to batching and hardware acceleration.

3. Theoretical Equivalence and Approximation Quality

Certain reformulations (e.g., (Shen et al., 2018) scaling normalization; (Katharopoulos et al., 2020) with kernel maps; (Mercat, 2020) second-order Taylor) yield mathematically exact results for specific normalization choices. For softmax, most linear analogues are approximate: first-order Taylor approximations and kernelized feature maps do not capture the full nonlinearity of softmax but, in empirical studies, degradation is minimal across tested tasks.

Higher-order expansions (Mercat, 2020, Nauen et al., 5 Mar 2024) provide closer fitting to exponential behavior, mitigating the expressivity gap at moderate increase in constant factors (dependent on feature dimension and expansion order). Agent Attention (Han et al., 2023) demonstrates a two-stage aggregation/broadcast formulation using agent tokens, shown to be mathematically equivalent to generalized linear attention formulations.

MetaLA (Chou et al., 16 Nov 2024) offers a unified view, showing optimal linear attention demands three factors: dynamic memory ability (via learnable decay), static approximation ability (capacity to match arbitrary distributions), and least parameter approximation (reducing redundancy such as unnecessary key matrices).

4. Practical Implementations and Empirical Impact

Linear attention reformulations enable integration of attention modules into architectures previously constrained by resource limits.

Efficient attention modules in Mask R-CNN yielded significant boosts in box AP and mask AP with negligible increase in memory (Shen et al., 2018).
Linear transformers achieved up to 4000× faster autoregressive inference on image generation tasks with competitive bits/dim scores (Katharopoulos et al., 2020).
Semantic segmentation networks with linear attention modules outperformed vanilla architectures on large-scale remote sensing datasets with higher OA, mIoU, and F1 (Li et al., 2020).
TaylorShift (Nauen et al., 5 Mar 2024) demonstrated competitive classification accuracy, with faster inference and less memory for sequences beyond 1700 tokens.
Agent Attention (Han et al., 2023) was empirically validated across ViT, segmentation, object detection, and diffusion models, delivering comparable or improved accuracy and faster image generation.
SEA (Lee et al., 2023) matched or exceeded perplexity of quadratic OPT-1.3B while halving memory requirements and maintaining interpretability.

Applications now include stereo depth estimation at scale, recall-intensive language modeling, autoregressive speech recognition, and real-time/embedded AI.

5. Distributed and Hybrid Training Paradigms

LASP-2 (Sun et al., 11 Feb 2025) recasts parallelization for linear attention, organizing model–device communication such that only fixed-size memory states are transferred via AllGather, decoupling communication cost from sequence length. This yields a 15.2%–36.6% training speed improvement on models like Linear-Llama3 (with 2048K sequence length, 64 GPUs) compared to previous ring-based or pure P2P SP.

Hybrid extensions (LASP-2H) combine linear attention and conventional softmax layers, aggregating memory and output states with single-pass parallel communication, facilitating scalable training for heterogeneous large models with competitive convergence performance.

6. Advances in Memory and Expressivity: Gated and Log-Linear Variants

Mechanisms such as Gated Slot Attention (GSA) (Zhang et al., 11 Sep 2024) and log-linear attention (Guo et al., 5 Jun 2025) extend linear attention reformulations to address limitations in long-context recall and expressiveness.

GSA introduces adaptive data-dependent gates to recurrent memory slots, overcoming attention dilution and facilitating efficient finetuning from pretrained transformers by retaining softmax in output computations.
Log-linear attention replaces the fixed-size context with $O(\log N)$ hidden states using Fenwick tree partitioning. The attention mask becomes a hierarchical composition, allowing models such as Mamba-2 and Gated DeltaNet to improve long-range associative recall with $O(N\log N)$ compute and $O(\log N)$ space.

These techniques help bridge the remaining performance gap between linear and softmax attention in demanding benchmarks.

7. In-Context Discovery of Numerical Algorithms

Training linear transformers on in-context masked-block completion (Lutz et al., 24 Sep 2025) yields emergent unified numerical solvers ("EAGLE") that implement resource-adaptive iterative updates valid across centralized, distributed, and rank-limited regimes. After training, algebraic unrolling reveals a parameter-free update rule achieving second-order convergence and competitive with direct solvers for matrix prediction and Nyström extrapolation.

The learned update, e.g.:

$D_{l+1} = D_l + \gamma B_l A_l^\top C_l \ C_{l+1} = C_l - \eta A_l A_l^\top C_l$

(with suitable scaling) unifies prediction, estimation, and kernel extrapolation, demonstrating deep algorithmic adaptation via linear attention and further highlighting the functional scope of reformulated mechanisms.

In summary, linear attention reformulation transitions attention computation from direct quadratic similarity to efficient associativity-, kernel-, or expansion-motivated variants. These advances support both theoretical equivalence (in certain cases) and strong empirical performance at scale. They have democratized attention mechanisms for long-sequence, high-resolution, and resource-constrained tasks, fostered scalable training and deployment strategies, and even enabled neural models to learn adaptive numerical algorithms from context, indicating a broad and impactful spectrum of future research.