Linear Transformers: Efficient Attention Models

Updated 29 August 2025

Linear transformers are neural sequence models that use kernel feature maps to transform quadratic self-attention into a linear complexity operation.
They employ iterative and recurrent update rules, supporting constant-time per-token inference and efficient long-context processing.
Extensions such as delta rule updates and higher-order approximations enhance memory capacity and enable applications from language modeling to operator learning.

Linear transformers are a class of neural sequence models that replace the standard quadratic-complexity self-attention mechanism of transformers with linearly scaling attention. This is achieved by representing the similarity function in self-attention as a dot product of kernel feature maps, allowing a computational reordering that reduces time and memory complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ in the sequence length $N$ . This design enables efficient processing of long contexts with resource usage comparable to recurrent neural networks (RNNs) while maintaining key benefits of the transformer architecture—including parallel training and competitive empirical accuracy.

1. Reformulation of Self-Attention in Linear Transformers

Conventional transformers compute self-attention using a scaled dot product and a softmax normalization: $V'_i = \frac{\sum_{j=1}^N \exp(Q_i^T K_j / \sqrt{D}) V_j}{\sum_{j=1}^N \exp(Q_i^T K_j / \sqrt{D})}$ This operation incurs a quadratic cost due to interactions between every pair of sequence positions.

Linear transformers, as introduced by Katharopoulos et al., reinterpret the similarity function as a dot product in the feature space: $\mathrm{sim}(q, k) = \phi(q)^T \phi(k)$ With this kernelization, self-attention becomes: $V'_i = \frac{\phi(Q_i)^T \left( \sum_{j=1}^N \phi(K_j) V_j^T \right)}{\phi(Q_i)^T \left( \sum_{j=1}^N \phi(K_j) \right)}$ Applying the associativity of matrix multiplication, the cumulative states $\displaystyle S = \sum_{j=1}^N \phi(K_j) V_j^T$ and $\displaystyle Z = \sum_{j=1}^N \phi(K_j)$ are shared for all positions and can be precomputed or incrementally updated, thereby reducing computational complexity to linear in $N$ (Katharopoulos et al., 2020).

2. Iterative and Recurrent Implementations

This kernelized attention admits a sequential, RNN-like recurrence. In the autoregressive (causal) setting, the state updates per token $i$ are: $\begin{aligned} S_i &= S_{i-1} + \phi(K_i)V_i^T \ Z_i &= Z_{i-1} + \phi(K_i) \ V'_i &= \frac{\phi(Q_i)^T S_i}{\phi(Q_i)^T Z_i} \end{aligned}$ The stateful computation enables constant-time inference per token and supports long sequences with fixed memory (Katharopoulos et al., 2020).

Casting the linear transformer as a fast weight programmer makes explicit the connection to RNNs with matrix-valued hidden states, where outer product updates maintain an associative memory. This perspective underlies the conceptual and practical unification between linearized transformers and RNNs (Schlag et al., 2021).

3. Feature Map Choices and Higher-Order Approximations

The expressivity of kernelized attention depends critically on the choice of the feature map $\phi(\cdot)$ . Initial designs use elementwise activation functions such as ELU $+1$ , but alternative projections such as deterministic parameter-free projection (DPFP) improve memory capacity by enforcing approximate orthogonality in the feature space (Schlag et al., 2021).

Approximation quality to softmax-based attention can be further improved by Taylor expanding the exponential kernel: $\exp(x) \approx 1 + x + \frac{1}{2} x^2$ and rewriting the attention mechanism via a higher-order (e.g., second-order) approximation for normalization. This maintains linear complexity but introduces additional higher-degree terms, computed efficiently via multinomial expansions (Mercat, 2020). While theoretically promising, the impact of higher-order terms must be weighed against computational overhead and sensitivity to normalization.

4. Memory Capacity, Update Rules, and Variants

A key limitation of the canonical (additive) linear transformer is its fixed-size associative memory, bounded by the dimension of the mapped key space. If more than $d_{\text{dot}}$ (dimension of $\phi(\cdot)$ ) mutually orthogonal keys are present, retrieval becomes unreliable due to memory interference (Schlag et al., 2021). The introduction of a delta rule—analogous to online error-correcting updates in classical learning—enables correction and partial overwriting of stored associations: $A^{(i)} = A^{(i-1)} + \beta^{(i)}(v^{(i)} - \bar{v}^{(i)}) \otimes \phi(k^{(i)})$ where $\bar{v}^{(i)}$ is the memory's current prediction for key $k^{(i)}$ and $\beta^{(i)}$ is a learned or adaptive scalar (Schlag et al., 2021, Irie et al., 2021). This modification, as implemented in DeltaNet and related architectures, improves both learning stability and associative recall and is particularly effective under conditions of repeated or colliding associations (Yang et al., 10 Jun 2024).

Extensions further generalizing the fast weight mechanisms include recurrency in the slow or fast networks, as well as self-referential memory updates. These extensions overcome theoretical limitations (such as the inability of vanilla architectures to recognize parity or star-free formal languages), improving both expressive power and generalization on structured sequence tasks (Irie et al., 2023, Irie et al., 2021).

5. Empirical Performance, Hardware Efficiency, and Scalability

Linear transformers have demonstrated up to 4000 $\times$ speedup in autoregressive image generation tasks (e.g., CIFAR-10) compared to softmax attention, while achieving comparable accuracy in bits per dimension or perplexity (Katharopoulos et al., 2020). This is attributed to their constant-time per-token inference and fixed-size memory requirements.

In language modeling, models such as DeltaNet trained at billion-parameter scale have outperformed Mamba and GLA in perplexity and zero-shot evaluation, especially in hybrid configurations combining linear and global attention mechanisms (Yang et al., 10 Jun 2024, Yang et al., 2023).

To maximize throughput and reduce memory bottlenecks, hardware-aware algorithms such as FlashLinearAttention segment the sequence into chunks amenable to high parallelization and efficient GPU memory access. These chunkwise algorithms, often using tiling strategies and recomputation–materialization trade-offs, yield faster training and inference than even highly optimized softmax attention variants (e.g., FlashAttention-2), especially for long sequences or modest batch sizes (Yang et al., 2023).

6. Applications and Theoretical Insights

The linearization of attention and its connection to fast weight programming provides a principled basis for a range of applications beyond classical sequence modeling:

In-context learning: Linear transformers can, in principle, implement in-context least-squares solvers for linear regression and perform as provably efficient in-context learners for families of linear systems, achieving explicit risk bounds as functions of the number of tasks, training/inference prompt lengths, and problem dimension (Zhang et al., 2023, Cole et al., 18 Sep 2024).
Operator learning: Applied to operator approximation (e.g., implicit solution of PDEs via linear system inversion), linear transformers satisfy sharp neural scaling laws for prediction error, and prompt length at inference can effectively mitigate out-of-domain task shift, provided sufficient diversity during pre-training (Cole et al., 18 Sep 2024).
Numerical linear algebra: Architectures such as NLAFormer demonstrate that appropriately configured transformers (using attention and FFN submodules) can express and learn basic linear algebraic operations—including pointwise arithmetic, shift, matrix multiplication, and conjugate gradient iteration—by exploiting the universal approximation properties of feedforward modules (Ma et al., 27 Aug 2025).
Graph algorithms: With explicit weight matrix constructions, linear transformers can simulate iterative methods for electric flow (gradient descent for Laplacian potentials) and eigenvector extraction (subspace iteration with QR orthogonalization) using only the graph incidence matrix as input (Cheng et al., 22 Oct 2024).

These diverse applications are enabled by the ability to represent and update structured associative memory within a recurrent or parallelizable framework, sidestepping the quadratic scaling that previously limited the deployment of transformer models on long sequences or non-default domains.

7. Theoretical Limitations, Robustness, and Future Directions

Despite their efficiency and flexibility, linear transformers have intrinsic limitations:

Memory capacity is bounded by the dimension of the feature space $\phi(\cdot)$ ; enhancements such as the delta rule or deterministic higher-dimensional kernels (e.g., DPFP) partially alleviate but do not remove this bound (Schlag et al., 2021).
Depth separation: For non-IID (e.g., dynamical system) data, single-layer linear transformers are provably suboptimal, exhibiting a positive error floor that can only be reduced by increasing architectural depth (logarithmic in the sequence length for matching least-squares rates) (Cole et al., 12 Feb 2025).
Adversarial robustness is limited: Single-layer linear transformers that implement gradient descent in-context are vulnerable to hijacking attacks, in which a single prompt modification can drive the output arbitrarily. Defenses based on adversarial training mitigate but do not entirely remove this vulnerability, and transferability of attacks across model scales is limited (Anwar et al., 7 Nov 2024).
Covariate or task distribution shifts: In-context learners using linear attention are brittle to input (covariate) distribution changes. In contrast, models built with broader memory mechanisms or additional recurrence demonstrate improved, but not complete, robustness (Zhang et al., 2023).

Analytical results for linear transformers—such as explicit characterization of SGD dynamics and their separation of timescales—provide macroscopic diagnostics (rank dynamics, subspace stabilization) for understanding the development of in-context learning and generalization in both linear and non-linear transformer models (Mainali et al., 17 Apr 2025).

Continued research directions involve:

Further developing hybrid architectures combining linear attention, global attention, and memory-augmented components for improved scalability and performance.
Extending deterministic kernel methods and conformal, gated recurrent extensions to enable scalable long-context extrapolation and address network capacity constraints (Kumar et al., 5 Mar 2025).
Deepening theoretical understanding of symmetry classes and their role in parameter space geometry, as well as practical model merging and transfer (Theus et al., 28 Jun 2025).

In conclusion, linear transformers offer a computationally efficient, theoretically tractable, and empirically competitive framework for long-sequence modeling and in-context algorithm learning, with a growing ecosystem of memory, recurrence, and kernelized extensions pushing the frontier of scalability and adaptability in modern neural architectures.