Factorized Self-Attention

Updated 28 March 2026

Factorized self-attention is a family of methods that decompose standard dot-product attention to reduce computation, memory overhead, and parameter count.
Techniques include low-rank approximations, block decompositions, sparse sampling, and covariance-driven reconstructions to efficiently model long sequences.
Empirical results show these methods achieve competitive performance in tasks like translation, video understanding, and classification while enhancing interpretability.

Factorized self-attention encompasses a family of architectural and algorithmic strategies for decomposing, approximating, or structurally modifying the standard dot-product self-attention mechanism in order to reduce computational and memory overhead, control parameter footprint, or enforce inductive priors. Across diverse neural models—including Transformers, convolutional recurrent cells, and hybrid architectures—factorization techniques have been exploited to enable efficient long-sequence modeling, enhance spatial-temporal expressiveness, and provide interpretability advantages. These approaches can involve low-rank factorizations of the attention (alignment) matrix, sequential or blockwise decomposition, structured sparsification, or analytic disentanglement of positional and content-related communication.

1. Mathematical Principles of Factorized Self-Attention

Factorized self-attention emerges from the observation that the canonical attention alignment matrix, $A = \mathrm{softmax}(QK^T)$ , is often highly redundant: either intrinsically low-rank or amenable to sparse or structured factor representations in practical data regimes.

Low-rank (bilinear) factorization: The alignment matrix $A\in\mathbb{R}^{N\times N}$ may be approximated by $A\approx AB$ where $A\in\mathbb{R}^{N\times k}$ , $B\in\mathbb{R}^{k\times N}$ for $k\ll N$ , as in the Factorized Random Synthesizer and LAMA attention (Tay et al., 2020, Mehta et al., 2019).
Structured block factorization: Decompose attention along spatial, temporal, or head/channel axes, performing attention within sub-blocks, then fusing outputs. Examples include spatial-temporal factorizations for video (ConViViT, ViViT) and blockwise sparse affinity products as in interlaced sparse self-attention (Dokkar et al., 2023, Huang et al., 2019).
Algorithmic low-rank approximation: Use partial eigen-decomposition or covariance-driven schemes to reconstruct full attention from a small set of computed affinities (Bhojanapalli et al., 2021).
Synthesis via learned factors: Replace $QK^T$ entirely with a parameterized or random factorized progression, as in Synthesizers, decoupling attention from explicit query-key token interactions (Tay et al., 2020).

These factorizations decrease quadratic cost in sequence/spatial length to linear or subquadratic, with modest trade-offs in expressivity.

2. Taxonomy of Factorization Techniques

A broad division of factorized self-attention approaches can be made as follows:

Approach	Method	Core Formula or Strategy
Low-rank factor synthesis	Random or learnable $R_1R_2$	$A = \mathrm{softmax}(R_1 R_2)$
Dense MLP-based factorization	Per-token param. $a_i, b_j$	$A\in\mathbb{R}^{N\times N}$ 0
Bilinear compact models	Global context, low-rank maps	$A\in\mathbb{R}^{N\times N}$ 1, $A\in\mathbb{R}^{N\times N}$ 2
Eigen/covariance reconstruction	Subset computation + SVD/Sigma	$A\in\mathbb{R}^{N\times N}$ 3, $A\in\mathbb{R}^{N\times N}$ 4
Block-structural decomposition	Spatial/temporal/focal blocks	$A\in\mathbb{R}^{N\times N}$ 5
Sparse/dilated window factor	Window select, fusion	Split input, sample/sparse keys, fuse

Methods such as the Factorized Random Synthesizer (Tay et al., 2020), LAMA (Mehta et al., 2019), and Linformer employ low-rank parameterizations. Structured variants for spatiotemporal or spatial arrangements (FaViT (Qin et al., 2023), ConViViT (Dokkar et al., 2023), interlaced sparse (Huang et al., 2019)) use block decomposition or interleaving. Reconstruction-based methods exploit the empirical low effective rank of real attention matrices (Bhojanapalli et al., 2021).

3. Algorithmic Workflows and Complexity

Parameter initialization: $A\in\mathbb{R}^{N\times N}$ 6, $A\in\mathbb{R}^{N\times N}$ 7.
Alignment logits: $A\in\mathbb{R}^{N\times N}$ 8 ( $A\in\mathbb{R}^{N\times N}$ 9).
Row-softmax: $A\approx AB$ 0 ( $A\approx AB$ 1).
Value projection & output: $A\approx AB$ 2 ( $A\approx AB$ 3).

Factorized Dense Synthesizer:

Per-position factor extraction: $A\approx AB$ 4, $A\approx AB$ 5.
Matrix assembly: $A\approx AB$ 6.
Softmax, output as above.

Replace standard per-position queries with a single global context $A\approx AB$ 7 and utilize low-rank factorization $A\approx AB$ 8.

Compute small subset of attention scores, reconstruct remainder via trained $A\approx AB$ 9 or projection in eigenbasis.

Partition feature/function space, perform local sparse attention, then combine outputs via permutation or aggregation to achieve global mixing.

Complexity Reduction:

Standard self-attention: $A\in\mathbb{R}^{N\times k}$ 0 (memory: $A\in\mathbb{R}^{N\times k}$ 1).
Factorized random: $A\in\mathbb{R}^{N\times k}$ 2, with memory $A\in\mathbb{R}^{N\times k}$ 3.
Windowed/block: $A\in\mathbb{R}^{N\times k}$ 4 for window size $A\in\mathbb{R}^{N\times k}$ 5.
Eigen/covariance: $A\in\mathbb{R}^{N\times k}$ 6 for $A\in\mathbb{R}^{N\times k}$ 7 computed entries per row.

4. Applications, Benchmarks, and Empirical Findings

Factorized self-attention has demonstrated competitive or superior performance across machine translation, language modeling, classification, semantic segmentation, time series, and video understanding:

Synthesizer variants: Nearly matches or marginally trails vanilla Transformers on WMT'14 MT (27.30 BLEU vs 27.67, $A\in\mathbb{R}^{N\times k}$ 8 parameter cost) and LM1B (40.6 PPL vs 38.1), but as a hybrid significantly outperforms on GLUE/SuperGLUE (+0.6, +1.9 over T5 base) (Tay et al., 2020).
FaSA/FaViT: Matches Swin-T's efficiency but surpasses in both accuracy (+1%) and robustness (+6.6pp on ImageNet-C), and further improves instance/semantic segmentation (Qin et al., 2023).
LAMA: Provides up to a 65% parameter reduction on text tasks (News, Reuters, IMDB), with similar or slightly better accuracy than BERT or CNN/GRU models (Mehta et al., 2019).
Eigen/covariance reconstruction: Yields $A\in\mathbb{R}^{N\times k}$ 9– $B\in\mathbb{R}^{k\times N}$ 0 FLOP reduction at cost of only $B\in\mathbb{R}^{k\times N}$ 1pp drop in MNLI downstream accuracy (Bhojanapalli et al., 2021).
Spatiotemporal factorization: In video, factorized SA (spatial then temporal) outperforms both full and parallel (dot-product) attention, as evidenced by ConViViT’s state-of-the-art results on HMDB51 (90.05%) and others (Dokkar et al., 2023).

Across these tasks, pure factorized modules alone typically trail full dot-product SA by a small margin, but hybridization (composition) yields best-in-class results, indicating strong complementarity (Tay et al., 2020).

5. Implementation Strategies and Practical Caveats

Parameter dependency on sequence length: Most factorized schemes (especially explicit $B\in\mathbb{R}^{k\times N}$ 2 ones) scale parameter count with $B\in\mathbb{R}^{k\times N}$ 3 (the longest sequence), requiring either truncation or tiling for variable-length data (Tay et al., 2020).
Factor size/rank selection: Optimal $B\in\mathbb{R}^{k\times N}$ 4 ( $B\in\mathbb{R}^{k\times N}$ 5) must balance expressivity and compactness to avoid underfitting or overfitting; practical values are often $B\in\mathbb{R}^{k\times N}$ 6– $B\in\mathbb{R}^{k\times N}$ 7 (Tay et al., 2020, Mehta et al., 2019).
Stability/regularization: Lower parameter count aids overfitting control; no special stabilization tricks are needed beyond standard training recipes (Tay et al., 2020).
Sparse/global context: Some factorized blocks allow information propagation across the global context by permutation, fusion, or occasional global steps (e.g., axial attention every $B\in\mathbb{R}^{k\times N}$ 8 steps in FAConvLSTM (Nji et al., 16 Jan 2026)).
Implementation details: Fast variants exploit modern deep learning primitives (unfold/sampling/aggregation), and can support dynamic cropping or windowing at inference (Qin et al., 2023, Huang et al., 2019).

6. Variants and Theoretical Analysis

Statistical/analytic factorization: Recent analytic approaches such as Bi-Orthogonal Factor Decomposition (BFD) do not modify the architecture but instead provide decompositional insight into the separation of positional and content effects in learned attention matrices (Doshi et al., 8 Jan 2026).
Partial/approximate computation: Covariance-driven selection schemes offer theoretical guarantees on error vs. cost and exploit the empirical concentration of attention in low-dimensional eigenspaces (Bhojanapalli et al., 2021).
Interpretability: Attention heads or singular modes induced by factorization acquire specialized roles (content-content vs. content-position), which correlate with robust shape/semantic sensitivity in self-supervised vision transformers (Doshi et al., 8 Jan 2026).

7. Limitations and Future Extensions

Granularity and information loss: Factorization imposes structural biases and may discard subtle correlations (e.g., very sparse keys in FaSA might miss certain dependencies) (Qin et al., 2023).
Sequence length scaling remains nontrivial, as most schemes retain $B\in\mathbb{R}^{k\times N}$ 9 computation for value mixing unless projection or window size is aggressively minimized.
Potential enhancements: Learnable or adaptive fusion across blocks, hybridization with global tokens, and integration with efficient patch-embedding or MLP architectures constitute active directions. Statistical diagnostic methods may further refine or inform new factorization strategies (Qin et al., 2023, Doshi et al., 8 Jan 2026).

Factorized self-attention subsumes a diverse set of methods for reducing the computational and memory burden of self-attention, exploiting empirical rank structure or modularizing attention for spatial, temporal, or hybrid architectures. These techniques offer strong empirical utility on large-scale tasks and serve as a foundation for further theoretical and practical advances in efficient neural sequence modeling (Tay et al., 2020, Qin et al., 2023, Huang et al., 2019, Mehta et al., 2019, Bhojanapalli et al., 2021, Dokkar et al., 2023, Doshi et al., 8 Jan 2026, Nji et al., 16 Jan 2026).