Factorized Self-Attention in Transformers

Updated 4 June 2026

Factorized self-attention is a method for decomposing Transformer attention into distinct channels to enhance efficiency, interpretability, and modularity.
It employs matrix algebra, axis-wise partitioning, and low-rank approximations to reduce computational complexity, as seen in approaches like BFD, FactoFormer, and LAMA.
Empirical results demonstrate that these techniques achieve semantic-positional disentanglement and significant runtime savings while maintaining model accuracy.

Factorized self-attention refers broadly to families of architectures, analytical techniques, and parameterizations in which the standard dense attention operations in Transformers are decomposed or factorized—algebraically or structurally—along one or more axes. This factorization targets either (a) greater efficiency (reducing computational and memory complexity), (b) interpretability (disentangling semantic or positional interactions), or (c) architectural modularity (explicitly capturing structure such as spatial, temporal, or spectral separations). Factorization may be realized at the matrix algebra level (e.g., low-rank factorization or Kronecker structure), along data axes (e.g., separable spatiotemporal, spectral-spatial, or interlaced blocks), or even at the level of learned attention weights themselves. This article surveys rigorous methodologies, theoretical frameworks, and empirical outcomes connected to factorized self-attention mechanisms.

1. Analytical Frameworks: Bi-Orthogonal Factor Decomposition in Vision Transformers

The Bi-Orthogonal Factor Decomposition (BFD) framework provides a principled methodology for parsing the informational contents exchanged via self-attention in Vision Transformers (ViT) (Doshi et al., 8 Jan 2026). BFD proceeds in two complementary steps:

ANOVA-style Decomposition of Token Embeddings: For each token embedding $f_\ell^{(p)}(x) \in \mathbb{R}^d$ $f_{ℓ}^{(p)} (x) \in R^{d}$ at layer $\ell$ $ℓ$ (patch $p$ $p$ , image $x$ $x$ ), embeddings are decomposed into three mutually orthogonal components (all expectations are over data/image and patch):
- Global mean: $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$
- Positional bias: $\mu_P^{(p)} = \mathbb{E}_x[f_\ell^{(p)}(x)] - \mu_L$
- Content residual: $\mu_C^{(p)}(x) = f_\ell^{(p)}(x) - \mu_L - \mu_P^{(p)}$ By construction, these components are exactly orthogonal.
Spectral (Bi-Orthogonal) Decomposition of Query-Key Matrix: For a single attention head,
- Form $Q = A W_Q$ and $K = A W_K$ (with $A$ the token embedding matrix), giving $\ell$ 0, where $\ell$ 1.
- Singular-value-decompose $\ell$ 2, with $\ell$ 3 defining bi-orthogonal mode triplets.
- The raw query–key interaction matrix can be expressed as $\ell$ 4.

Further factorizing the token embedding components, BFD projects each component along these singular directions, enabling precise ANOVA decomposition of attention "energy" into content–content (C–C), content–position (C–P), and position–position (P–P) contributions. Empirically, across ViTs and self-supervised DINOv2 models, the C–C channel dominates, with DINOv2 displaying richer and more distributed mode spectra, stronger C–P coupling, and tighter functional specialization by head and mode (Doshi et al., 8 Jan 2026).

2. Structured and Axis-Wise Factorizations: Spatiotemporal and Spectral-Spatial Decompositions

Factorization by data axis is exemplified by models such as FactoFormer (Mohamed et al., 2023) and spatiotemporal transformers in activity recognition (Dokkar et al., 2023). These architectures exploit the natural separability of input domains (e.g., space vs. time; spectral vs. spatial) to construct independent transformer branches:

FactoFormer: For hyperspectral data, tokens are constructed as either entire spectral bands at each spatial location ("spectral tokens") or as the full pixel spectrum at each spatial position ("spatial tokens"). Spectral and spatial branches are each processed by independent transformer layers, and their outputs are fused at the classification token ("CLS") level. This enables separate pretraining (self-supervised masking strategies) and is proven to reduce computational cost from $\ell$ 5 to $\ell$ 6 per cube, with B the number of bands and S the patch dimension (Mohamed et al., 2023).
Spatiotemporal Factorization (ConViViT): For video, input is $\ell$ 7 (T frames, N patches), and two independent attention steps are performed: frame-wise spatial self-attention, followed by patch-wise temporal self-attention (or vice versa). Each step applies standard multi-head self-attention along its axis. The overall cost drops from $\ell$ 8 to $\ell$ 9, yielding orders-of-magnitude savings and improved performance on activity recognition (Dokkar et al., 2023).

Model/Method	Factorization Type	Main Efficiency Benefit
BFD (ViT)	Statistical factor + SVD	Interpretable C–C, C–P, P–P decomposition
FactoFormer	Spectral/Spatial axis	$p$ 0 vs $p$ 1
ConViViT/VIViT	Temporal/Spatial axis	$p$ 2 vs $p$ 3

3. Algebraic Factorization: Low-Rank, Sparse, and Hybrid Attention Mechanisms

Direct algebraic factorization of the attention operation targets memory and compute savings, and includes both low-rank and sparse approaches.

Low-Rank Multi-Head Factorization: In LAMA (Mehta et al., 2019), each head's bilinear weight matrix $p$ 4 is factorized as $p$ 5 (with $p$ 6 of size $p$ 7), yielding linear-in-sequence (O( $p$ 8)) rather than quadratic complexity. Multiple heads are bundled via reshaping $p$ 9 and $x$ 0, and attention is bilinear with respect to a single global context vector.
Factorized Synthesizer: Dot-product attention is replaced by a learned low-rank alignment matrix $x$ 1 ( $x$ 2, $x$ 3). This head is input-agnostic and the softmax is row-wise over $x$ 4. Despite being divorced from the explicit query-key mechanism, in encoding scenarios the factorized Synthesizer matches or outperforms Linformer and Transformer baselines at only $x$ 5 parameter cost (Tay et al., 2020).
Interlaced Sparse Self-Attention: The affinity matrix is factorized multiplicatively: $x$ 6, with $x$ 7 and $x$ 8 each block-diagonal and constructed over long-range and short-range groupings, respectively. The composition ensures any two positions in a feature map communicate in two steps, achieving $x$ 9 compute vs $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 0 for $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 1 positions, $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 2 channels (Huang et al., 2019).

Variant	Main Factorization	Expressiveness/Trade-off
LAMA	Low-rank bilinear	Context-dependent, linear
Factorized Synthesizer	Static low-rank align.	Input-agnostic, very efficient
Interlaced Sparse SA	Block-diag $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 3	Full coverage, $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 4

4. Eigenanalysis and Reconstruction: Low-Rank Subspace Structure in Attention

Attention mechanisms in standard Transformers have been empirically observed to inhabit low-dimensional subspaces, as shown by global eigenanalysis of the attention logit matrix over large data distributions (Bhojanapalli et al., 2021). The principal findings are:

The covariance of flattened attention logits $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 5 (from $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 6) is nearly low-rank, with the leading $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 7 eigencomponents capturing most variance.
Attention scores for new inputs can be reconstructed from a small sampled subset ( $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 8) via linear regression using the empirical covariance. This reduces computation from $\mu_L = \mathbb{E}_{x,p}[f_\ell^{(p)}(x)]$ 9 to $\mu_P^{(p)} = \mathbb{E}_x[f_\ell^{(p)}(x)] - \mu_L$ 0 per layer with tolerable error.
Subspace structure is highly shared across layers and models; thus, pre-computation of reconstruction weights is feasible.

This supports data-driven, empirical low-rank factorization schemes for efficient and accurate self-attention (Bhojanapalli et al., 2021).

5. Empirical Outcomes, Design, and Interpretability Implications

Factorized self-attention contributes both to greater interpretability and to practical performance gains:

Semantic-Positional Disentanglement: BFD reveals that most attention energy in both supervised and self-supervised ViTs is allocated to content–content channels. Content–position and position–position channels are nonetheless crucial for specific functional properties, with DINOv2 dedicating systematically more energy to content–position coupling and exhibiting richer, less mode-aligned spectra (Doshi et al., 8 Jan 2026).
Head and Mode Specialization: Singular modes within attention heads cluster into distinguishable functional classes (C–C, C–P, P–P). This specialization is tighter in self-supervised models.
Efficient Factorizations: Low-rank and axis-structured variants (e.g., LAMA, FactoFormer, Interlaced SA) achieve substantial savings in runtime and parameter count, without sacrificing—or often improving—downstream accuracy on classification and segmentation benchmarks (Mehta et al., 2019, Mohamed et al., 2023, Huang et al., 2019).
Interpretability: Mechanistic explanations are enabled: e.g., BFD provides head/mode-level traces to specific (semantic or spatial) factor pairs, interlaced sparse SA covers all positions by design, and low-rank factor models elucidate principal subspaces underlying contextual dependencies.
Design Levers: Factorization introduces architectural and training levers, such as increasing the rank/capacity of query-key interaction ( $\mu_P^{(p)} = \mathbb{E}_x[f_\ell^{(p)}(x)] - \mu_L$ 1), explicit cross-modal gating, and spectrum-regularizing penalties for attention diversification (Doshi et al., 8 Jan 2026).

6. Generalizations and Application Domains

Factorized self-attention is broadly applicable across data modalities and use cases:

Multi-axis Data: FactoFormer’s dual-branch paradigm extends directly to video (temporal vs. spatial), multivariate time series (feature vs. time), and volumetric imaging (slice vs. region) (Mohamed et al., 2023).
Efficient Long-Range Modeling: Sparse and low-rank factorized models enable tractable self-attention for high-resolution imagery, long documents, videos, and large-scale multimodal fusion.
Self-Supervised and Hybrid Training: Self-supervised, axis-specific pretraining (e.g., masked band or patch prediction) is enabled by axis factorization. Hybrid losses leveraging both positional alignment and semantic contrast are supported (Mohamed et al., 2023, Doshi et al., 8 Jan 2026).

7. Limitations, Open Problems, and Future Directions

Present factorization schemes involve trade-offs in model expressiveness, input-dependency, and coverage. Static low-rank models (e.g., factorized Synthesizer) are highly efficient but may not capture context-specific dependencies. Sparse/factorized models reconstruct full affinity structure with fewer resources but may require careful groupings to cover all position pairs. Open problems include:

Systematic evaluation of expressiveness vs. efficiency among algebraic factorization families in high-capacity models.
Automated selection of optimal rank/cutoffs or permutation groupings for block-sparse compositions.
Integration of interpretability tools such as BFD with efficient axis-wise or algebraic factorization schemes to unify mechanistic transparency with operational efficiency.
Exploration of hybrid and dynamically-adaptive factorization strategies, potentially conditioned on input modality or content.

Factorized self-attention provides both an epistemic lens for analyzing arbitrary attention mechanisms and a constructive toolkit for engineering next-generation, efficient, and interpretable transformer architectures across modalities and tasks.