Self-Attention in Transformers

Updated 7 November 2025

Self-attention in Transformers is a mechanism that computes contextual representations by modeling pairwise token interactions through learned query-key dot products.
It enables multi-head parallel processing and deep contextual reasoning by dynamically adjusting attention weights based on input content.
Limitations such as quadratic complexity and dispersed attention prompt research into efficient approximations and architectural enhancements.

Self-attention is a core mechanism in the Transformer architecture, enabling models to compute contextual representations of each sequence element by explicitly modeling all pairwise dependencies via a learned, content-dependent affinity matrix. The canonical self-attention formulation, as originally introduced in "Attention Is All You Need" (Vaswani et al., 2017), computes, for each token, a weighted sum of all other tokens, with the weights determined by the similarity of learned query and key projections. This mechanism underlies the scalability, modeling power, and parallelization advantages of Transformers, and its mathematical characteristics, representational structure, and practical consequences have been studied from perspectives including graph filtering, inductive bias, efficiency, generalization, and inherent computational limitations.

1. Mathematical Structure and Core Computation

The standard self-attention module operates on an input sequence $X \in \mathbb{R}^{n \times d}$ by first projecting each token embedding via learned, independent linear transformations to queries $Q \in \mathbb{R}^{n \times d_k}$ , keys $K \in \mathbb{R}^{n \times d_k}$ , and values $V \in \mathbb{R}^{n \times d_v}$ : $Q = XW^Q\,, \quad K = XW^K\,, \quad V = XW^V\,.$ The attention affinity matrix $A \in \mathbb{R}^{n \times n}$ is then computed as: $A_{ij} = \mathrm{softmax}_j \left( \frac{q_i^\top k_j}{\sqrt{d_k}} \right),$ where each row sums to one. The output at each position is a content- and position-weighted mixture over value vectors: $Z = AV.$ Multi-head attention concatenates $h$ parallel copies of this operation, each with its own projection weights.

This mechanism defines a fully-connected, directed, content-sensitive graph over the sequence, with learning and generalization driven by the structure of $A$ . The process is repeated and stacked, yielding deep contextual representations.

2. Representational Dynamics and Inductive Bias

Self-attention's reliance on the affinity matrix structurally aligns with the broad paradigm of affinity-based computation frameworks, such as Infinite Feature Selection (Inf-FS) (Roffo, 19 Jul 2025), spectral clustering, and nonlocal image filtering (Abdullaev et al., 12 Jun 2025). The Transformer implementation is distinguished by constructing $A$ dynamically as a content-dependent, input-specific quantity, as opposed to handcrafted or statistics-based affinity used in prior models.

In feature selection or nonlocal means filtering, importance or denoised values are computed by aggregation over potentially infinite multi-hop relationships via power series on $A$ : $S = \sum_{k=1}^\infty \alpha^k A^k = (I - \alpha A)^{-1} - I.$ Self-attention restricts this to a single (layer-wise) hop and composes multi-hop reasoning via stacking. This difference impacts both expressivity and inductive bias: the stacking mechanism enables the model to perform deep contextual reasoning, but the reliance on the dot-product for $A$ encodes specific similarities, and the softmax normalization imposes locality and focus.

Self-attention has an inherent tendency to concentrate "attentional resources" and can exhibit "explaining away" effects, whereby only a subset of input tokens significantly influence each output (see (Ding et al., 2020) and the discussion of doubly-normalized attention schemes). This mechanism does not guarantee that all input information is propagated forward, and may require variants such as doubly-stochastic normalization (Sinkhorn normalization) to ensure information flow from all tokens.

3. Limitations, Capacity, and Architectural Constraints

Self-attention, despite its theoretical ability to model arbitrary pairwise dependencies, is subject to intrinsic architectural limitations. Empirical investigations demonstrate that Transformer-based models exhibit sharply limited working memory capacity on N-back tasks (Gong et al., 16 Sep 2024). Training vanilla, decoder-only Transformers on N-back benchmarks reveals a logarithmic decline in accuracy as $N$ increases, paralleling human cognitive limitations.

The mechanistic deficit is traced to dispersion in the attention weight matrix. While the model learns to focus attention on the $i-N$ position to solve the N-back task, as $N$ increases, the attention entropy,

$H_N(A) = -\sum_{i=1}^T \sum_{j=1}^i A_{ij} \log A_{ij},$

increases correspondingly, reflecting a failure to sharply focus on the relevant information. As total entropy rises, accuracy falls, indicating that the self-attention mechanism's ability to allocate selective, focused attention is fundamentally constrained by interference and dispersion. These bottlenecks are not alleviated by scaling context length or parameter count, and persist even in small models where the context window is not a limiting factor.

A plausible implication is that vanilla self-attention mechanisms lack the architectural means for resource-adaptive or structured information retrieval required for scalable working memory and algorithmic reasoning tasks.

4. Effects on Representation Geometry: Anisotropy and Information Flow

An inherent property of self-attention in Transformers is the development of anisotropic geometry in the hidden representations (Godey et al., 22 Jan 2024). Empirical studies show that, regardless of input modality and training objective, Transformer representations exhibit high pairwise cosine similarity—unlike random high-dimensional vectors, which are nearly orthogonal. This anisotropy emerges from learning dynamics that "drift" queries and keys (mean representations) together over training, resulting in large values of $QK^\top$ and facilitation of sharp (low entropy) attention distributions.

Self-attention thus enforces a trade-off: sharpening attention for focusing purposes necessarily degrades the isotropy (expressive geometry) of the embedding space; centering or whitening the representations post hoc can mitigate this, but may degrade attention's selection fidelity. This effect is absent in architectures without self-attention, such as CNNs, which retain isotropic representations.

The implication is that controlling or mitigating anisotropy requires architectural changes to self-attention itself, not merely representational post-processing.

5. Interpretability, Application to Different Modalities, and Causal Perspectives

Self-attention mechanisms admit detailed empirical and theoretical interpretability. In vision, self-attention in Vision Transformers (ViTs) can be analyzed as semantic grouping, not classic attention: the mechanism acts as a global relaxation labeling process based on similarity, not selective amplification (Mehrani et al., 2023). This perceptual grouping property emerges naturally from the dot-product kernel and feed-forward architecture.

In self-supervised audio transformers, attention heads can be quantitatively categorized (global, diagonal, vertical) with distinct functionalities: diagonal heads track local and phonemic structure, vertical heads focus on speaker identity, and global heads often provide redundant or less useful structure (Yang et al., 2020). Visualization, pruning, and attention span diagnostics enable detailed functional interpretations and model refinement.

Recent work frames self-attention as implicitly estimating a structural causal model (SCM) over input sequences, with the attention matrix serving as an estimator of the total-effect matrix in linear-Gaussian SCMs (Rohekar et al., 2023). This correspondence enables zero-shot, input-specific causal graph discovery via partial correlation analysis over final-layer representations, even in the presence of latent confounders.

6. Efficiency, Parameterization, and Alternative Mechanisms

Self-attention's quadratic complexity in sequence length has motivated numerous structured approximations and reparameterizations. Clustering-based approaches such as CAST (Engelenhoven et al., 6 Feb 2024) partition tokens via surrogate tokens and compute intra- and inter-cluster attention, reducing complexity from $O(N^2)$ to approximately $O(\alpha N)$ with empirically competitive accuracy. Linear transformers, learnable graph filters in the singular value domain (Wi et al., 13 May 2025), and graph filter-based self-attention (GFSA) (Choi et al., 2023) generalize the attention kernel beyond simple low-pass (averaging) filters, enabling high-frequency preservation and mitigating oversmoothing.

Parameter-sharing strategies, such as using a single shared projection matrix for query, key, and value (with per-role scaling) (Kowsher et al., 30 Nov 2024), can reduce attention block parameter count by over two-thirds with no loss in GLUE benchmark accuracy and increased robustness to noise. Studies demonstrate that explicit modeling of logical or algorithmic reasoning steps may be achieved entirely within self-attention, not just the feed-forward layers (Shin et al., 20 Jan 2025, Hagiwara, 31 Mar 2025), given appropriate architectural design.

Empirically, dot-product based content attention can be replaced or supplemented with synthetic attention (Synthesizer) mechanisms that learn or sample attention patterns independent of token-token interactions, with hybrid models recovering or surpassing baseline performance (Tay et al., 2020). This challenges the necessity of token-token content attention for many tasks.

7. Implications for Future Model Design and Theoretical Synthesis

The self-attention mechanism in Transformers structurally embodies the affinity-based computation paradigm, unifying approaches across feature selection, graph theory, nonparametric regression, and associative memory (Roffo, 19 Jul 2025). Its distinctive strength lies in dynamically learning and applying input-dependent, content-sensitive affinity matrices in a fully differentiable, end-to-end framework.

However, working memory limitations, representational anisotropy, inefficiencies, and modality-specific behaviors reflect both the power and the boundaries of this approach. Promising directions include:

Enhancing attention focus via entropy control or targeted allocation mechanisms,
Reducing or controlling anisotropy by decoupling selection sharpness from representation geometry,
Embedding attention into generalized filtering kernels for high-frequency preservation and oversmoothing mitigation,
Exploring parameter-efficient and inductive-bias-flexible variants,
Theoretically grounding attention via primal-dual, optimization, and causal inference interpretations (Nguyen et al., 19 Jun 2024, Rohekar et al., 2023).

Self-attention thus occupies a mathematically principled, empirically validated, and continuously evolving position at the center of modern sequence modeling, with ongoing research addressing its limitations and extending its computational and representational scope.