Linear Self-Attention (LSA) Models

Updated 17 October 2025

Linear Self-Attention (LSA) models are techniques that approximate standard Transformer attention with linear complexity through low-rank, kernel, and sparsity methods.
They enable efficient processing of long sequences in NLP, vision, and multimodal systems while retaining much of the representational power of dense attention.
LSA implementations—such as Linformer and LISA—balance expressiveness and efficiency by employing structural assumptions and external memory to reduce computational bottlenecks.

Linear Self-Attention (LSA) models encompass a spectrum of mechanisms designed to approximate or reformulate the traditional self-attention operation from Transformers, achieving linear time and space complexity with respect to sequence length. These methods address the quadratic bottleneck of standard attention by leveraging structural assumptions (e.g., low-rank, kernel factorization, randomized mappings, histogram aggregation, or external memory) and, in some cases, kernelized or graph signal processing perspectives. LSA models have been instantiated as both primitives in NLP and vision backbones, as well as domain-specialized modules for recommendations, multimodal fusion, and supporting efficient in-context learning.

1. Foundational Principles and Motivations

Standard self-attention as in Transformers computes an $n\times n$ matrix for a sequence of length $n$ , requiring $\mathcal{O}(n^2)$ computation and memory. LSA models circumvent this via assumptions such as:

Low-rank approximation: The attention matrix is assumed or empirically found to be low-rank in many practical settings (Wang et al., 2020). This allows projection into a lower-dimensional space.
Kernel factorization: The exponential similarity kernel (e.g., $\exp(q^\top k)$ ) is factorized as a product of feature mappings and approximated using random features or trainable kernels (Yorsh et al., 2022, Zheng et al., 2022).
Sparsity and structured aggregation: Structural bias, such as codebook quantization or local attention, enables compressive representations and efficient aggregation (Wu et al., 2021, Hechen et al., 2022).
Randomized/sampling-based approximation: The attention weights are sampled or estimated using Locality Sensitive Hashing or randomized hash collision mechanisms (Zeng et al., 2021).
External attention and memory: Lightweight external memory matrices allow for attentive computation that avoids the full quadratic affinity matrix (Guo et al., 2021).

The central motivation across LSA variants is to maintain the representation power of dense attention while achieving substantial efficiency gains, making attention practical for very long sequences or for fast inference in real-time and resource-limited scenarios.

2. Linearization Methodologies in LSA Mechanisms

Several core methodologies underpin LSA models:

Approach	Computational Principle	Key Examples
Low-rank projections	Keys/values projected via $\mathbf{E}, \mathbf{F}$ ( $n\times n$ $\rightarrow$ $n\times k$ ), $k \ll n$	Linformer (Wang et al., 2020), MLSA4Rec (Su et al., 18 Jul 2024)
Random feature kernel	Exponential kernels factorized as products of random or trainable mappings $\phi(q)$ , $\phi(k)$	Performer, RFA (Zheng et al., 2022), LARA (Zheng et al., 2022, Yorsh et al., 2022)
Histogram aggregation	Codeword histogram replaces full interactions; prefix sums/histograms over quantized codebooks	LISA (Wu et al., 2021)
Sampling-based	Bernoulli/Lokality Sensitive Hashing samples attention contributions	YOSO (Zeng et al., 2021)
External/global memory	Attention computed via two learnable memories, affording $O(n)$ complexity	External Attention (Guo et al., 2021)
Trainable feedforward	Self-attention kernel approximated by shallow or GLU-based FFNs ensuring positivity	(Yorsh et al., 2022)
Singular value domain	Attentive graph filter with polynomial filtering in learned singular value domain	AGF (Wi et al., 13 May 2025)
Element-wise expansion	Exponential per-channel Taylor expansion replaces dense dot-products	Element-wise Attention (Feng, 10 Jan 2025)

For instance, the Linformer approximates the attention matrix as:

$\mathrm{head}_i = \operatorname{softmax}\left(Q W^Q_i (E_i K W^K_i)^\top / \sqrt{d}\right)\cdot (F_i V W^V_i)$

where projection matrices $E_i, F_i$ are $n\times k$ . This reduces cost to $O(nk)$ .

In histogram-based approaches such as LISA, the input is quantized into a fixed number of codewords across multiple codebooks. Each token's representation is encoded as a histogram over these codewords, and attention is aggregated by computing context-dependent weighted averages over codewords, yielding fixed attention complexity that is independent of sequence length.

3. Theoretical Properties and Trade-offs

Low-Rank and Approximation Guarantees

For methods based on low-rank projections (Linformer, Revisiting Linformer (Verma, 2020)), theoretical results guarantee that with $k = O(\log n)$ or $k = O(d/\varepsilon^2)$ , the low-rank projection retains most of the self-attention effect with bounded error, by a Johnson–Lindenstrauss-type argument (Wang et al., 2020).
Bias-variance trade-offs arise in random feature and sampling-based approaches (Zheng et al., 2022, Zeng et al., 2021). LARA (Zheng et al., 2022) reduces bias relative to random feature attention by employing multiple importance sampling and self-normalization, balancing the expressiveness of softmax attention with linear resource requirements.

Flexibility Versus Approximation Quality

Trainable kernels (Yorsh et al., 2022) and element-wise Taylor expansions (Feng, 10 Jan 2025) increase the representational flexibility but introduce extra parameterization and tuning requirements. The order of the polynomial in element-wise attention controls the "spikiness" and expressiveness, trading off efficiency against fidelity to the original exponential attention distribution.

Limiting Cases and Efficiency

Recent analyses show that for some variants, e.g., with codebook histogram-based LISA (Wu et al., 2021) and external attention (Guo et al., 2021), the limiting computational complexity is $\mathcal{O}(n)$ (linear in sequence length), provided codebook/memory size is held fixed.
Methods that remove explicit dependencies on hyperparameters or matrix rank (via rearrangement of computations) offer further simplification and robustness across tasks (Verma, 2020).

4. Practical Applications and Empirical Results

NLP and Language Modeling

LSA models such as Linformer and LLN Attention (Nahshan et al., 2023) deliver comparable performance to standard softmax attention on masked language modeling, GLUE, and long-sequence reading comprehension, at a fraction of time/memory consumption.
Sampling-based and histogram-based approaches enable long document encoding, sequence tagging, and high-throughput generation (Wu et al., 2021, Zeng et al., 2021).

Computer Vision

Plug-and-play LSA modules have been successfully integrated into vision transformers, object detection, and image segmentation backbones (Feng et al., 2021, Kang et al., 27 Feb 2024, Hechen et al., 2022). The ELSA module (Zhou et al., 2021) demonstrates improved local feature extraction via Hadamard attention and ghost head augmentation.
AGF (Wi et al., 13 May 2025) mitigates over-smoothing in deep vision transformers, yielding state-of-the-art results on Long Range Arena and UEA time series benchmarks.

Multimodal and Recommendation Systems

LSA architectures have been adapted for multimodal fusion of hyperspectral and LiDAR inputs (Feng et al., 2021), multimodal plug-in modules (channel and spatial attention), and codebook histogram-based sequence representations in recommendation (Wu et al., 2021, Su et al., 18 Jul 2024).
MLSA4Rec (Su et al., 18 Jul 2024) illustrates linear-complexity hybrid recommender models by combining a low-rank LSA module (latent interest aggregation) with a selective state-space model (Mamba), outperforming both isolated LSA- and SSM-based recommenders.

5. Model Properties, Extensions, and Theoretical Insights

Decomposition and Parametrization

The in-context learning ability of linear attention models is highly sensitive to parametrization. Models with merged (shared) key/query matrices exhibit abrupt acquisition of ICL through a single large jump in loss when escaping an unstable fixed point, implementing least-squares regression over cubic features (Zhang et al., 27 Jan 2025).
In contrast, models with separate parametrization for keys and queries exhibit saddle-to-saddle training dynamics, corresponding to progressive learning of principal components of the input covariance—a mechanism interpretable as in-context principal component regression.

Algorithmic Expressivity

The introduction of bias terms in extended linear self-attention (ELSA) modules enables arbitrary constant outputs, skip connections, and general matrix multiplications (Hagiwara, 31 Mar 2025). This supports heuristic implementations of computational algorithms—such as unrolled gradient descent for ridge regression—by directly manipulating input and context in a sequence of ELSA blocks.

Graph Signal Processing and Spectral Properties

AGF (Wi et al., 13 May 2025) provides a principled connection between self-attention and graph signal processing, revealing that standard attention acts as a low-pass graph filter. Learning advanced graph filters in the singular value domain allows LSA models to adaptively capture high-frequency information, thereby increasing expressive power while retaining $O(nd^2)$ complexity.

6. Limitations, Challenges, and Future Directions

The assumption of attention matrix low-rankness may fail on certain classes of tasks or highly entropic/disordered inputs, warranting further empirical and theoretical paper (Verma, 2020).
Bias and approximation errors introduced by histogram quantization, random feature mapping, or sampling may degrade performance in regimes requiring fine-grained or rare-event modeling (Wu et al., 2021, Zheng et al., 2022).
Architectural and training choices—including the rank of projections, size of codebooks, choice of kernel basis (e.g., Jacobi for stability), and the interplay of local/global attention—remain active research areas with significant impact on scalability and generalization.
The integration of LSA mechanisms with other efficient architectures—such as SSMs (Mamba), deep recurrent mechanisms, or attention sharing across layers (LiSA)—offers new opportunities for both empirical gains and deeper theoretical understanding of what efficient, scalable attention should entail in modern large-scale models (Mu et al., 4 Aug 2024, Su et al., 18 Jul 2024).
Further progress is anticipated in hybrid models that dynamically blend local and global LSA modules, data-adaptive graph filters, and task-aware parametrization for robust, high-throughput training and inference across modalities.

7. Summary Table of Representative LSA Models

Name/Ref.	Core Linearization Method	Domain(s)	Notable Properties
Linformer (Wang et al., 2020)	Low-rank projection via learned matrices	NLP, vision	$O(nk)$ , theoretical guarantees
Revisiting Linformer (Verma, 2020)	Rearranged low-rank, k-independent	NLP, vision, audio	Eliminates projection hyperparam
LISA (Wu et al., 2021)	Codebook histogram aggregation	Recommendation	$O(n)$ , supports causal/noncausal
External Attention (Guo et al., 2021)	Shared external memories, linear layers	Vision, MLP	Dataset-level priors, $O(n)$
LARA (Zheng et al., 2022)	Multiple randomized proposals, kernel	CV, NLP, video	Reduces RFA bias, linear complexity
ViT-LSLA (Hechen et al., 2022)	Q projection only (replace K, V with input)	Vision	Reduces parameters, adds position
AGF (Wi et al., 13 May 2025)	Polynomial graph filter in singular value	LRA, CV, time series	Graph spectrum, mitigates smoothing
Element-wise Attention (Feng, 10 Jan 2025)	Per-channel Taylor expansion, RNN reform	Time series, generic	$O(tLD)$ train, $O(tD)$ inference
ELSA (Zhou et al., 2021)	Hadamard attention, ghost heads	Vision, ViT	Drop-in, high-order local mapping
Interactive MHSA (Kang et al., 27 Feb 2024)	Landmarking, cross-head layer	Vision	Decomposed, interactive, $O(N)$
MLSA4Rec (Su et al., 18 Jul 2024)	Low-rank interest, hybrid with Mamba	Recommendation	Local-global, linear, SSM hybrid
LiSA (Mu et al., 4 Aug 2024)	Cross-layer sharing, low-rank diff.	LLMs	6× Q/K compression, per-layer fusion

The continuous development and refinement of LSA models reflect a fundamental trend in deep learning: the pursuit of expressive global context modeling with favorable resource scaling, facilitated by principled architecture and algorithm design grounded in approximation theory, optimization, and domain structure.