Self-Attention in Transformers

Updated 29 March 2026

Self-attention is a mechanism that projects tokens into queries, keys, and values to compute dynamic, content-driven interactions across sequences.
The multi-head design allows parallel processing and integrates both local and global information, enhancing long-range dependency modeling.
Architectural variants and causal interpretability analyses reveal self-attention’s memory limitations and inspire hybrid approaches in NLP, vision, and graph applications.

Self-attention is the core computational primitive in Transformer models, facilitating dynamic, content-driven interactions between all positions in a sequence. By leveraging parallelizable, permutation-equivariant mechanisms, self-attention has supplanted recurrence and convolution as the default paradigm for large-scale sequence modeling across modalities. Recent studies have further elucidated both the fundamental limitations and the architectural extensions stemming from the self-attention mechanism.

1. Formal Definition and Computational Principles

Given an input sequence $X \in \mathbb{R}^{n \times d}$ , self-attention operates by projecting each token representation into Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) vectors via learned matrices $W^Q, W^K, W^V \in \mathbb{R}^{d \times d_k}$ : $Q = XW^Q,\quad K = XW^K,\quad V = XW^V.$ The canonical (scaled dot-product) attention computes the pairwise similarity scores $S = QK^\top/\sqrt{d_k}$ , to which an optional mask $M$ (e.g., causal or local) may be added. Attention weights $A$ are produced by applying the softmax function row-wise: $A_{ij} = \frac{\exp(S_{ij} + M_{ij})}{\sum_{k=1}^n \exp(S_{ik} + M_{ik})}.$ The output representation at position $Q$ 0 is

$Q$ 1

Multi-head self-attention employs $Q$ 2 parallel sets of $Q$ 3 projections, enabling the model to jointly attend to information from different subspaces. Outputs from each head are concatenated and linearly projected back to the model dimension (Vaswani et al., 2017).

This design ensures $Q$ 4 path length between any token pairs, $Q$ 5 complexity per layer, and maximally parallel execution, making it highly effective for global context integration and extensible to diverse architectures.

2. Affinity Matrix Perspective and Theoretical Generalizations

Self-attention is a special case of affinity-matrix-based information propagation. General affinity frameworks, such as Infinite Feature Selection (Inf-FS), compute relevance via powers of a fixed or learned affinity matrix $Q$ 6: $Q$ 7 where each entry of $Q$ 8 quantifies similarity between elements $Q$ 9 and $K$ 0 (Roffo, 19 Jul 2025). In contrast, self-attention learns $K$ 1 on-the-fly from the input and applies only a single-hop aggregation per layer, with global (multi-hop) propagation arising through stacking.

Inf-FS achieves closed-form multi-hop feature ranking, while Transformers implement deep, interleaved one-hop aggregations for dynamic representation learning. This affinity formalism unifies attention with affinity-based graph propagation, non-local vision blocks, and graph attention in GNNs.

3. Architectural Structure, Parameterization, and Variants

The standard self-attention module can be extended or constrained through various design choices:

Head Specialization: In practical encoder models, many self-attention heads collapse to trivial positional patterns (diagonal, previous/next token, context aggregates), leading to redundancies. Explicitly replacing all but one head with fixed, non-learnable templates preserves or improves BLEU scores in low-resource machine translation (Raganato et al., 2020).
Hybrid and Synthetic Attention: Content-based attention may be replaced with "Synthesizer" modules that use random or per-token MLP-generated alignment matrices—removing query-key interaction—often without significant loss in translation or language modeling performance. Hybrid mixtures surpass standard dot-product attention in several tasks (Tay et al., 2020).
Locality and Translation-Invariance: Models can restrict attention to local neighborhoods, encode relative or translation-invariant positional biases, or employ hybrid global/local masking. Empirical studies show that most attention heads have a locality bias, and models with most or all heads constrained to local windows (±2 tokens) match or slightly exceed unconstrained accuracy while reducing computation by nearly half (Pande et al., 2020, Wennberg et al., 2021).
Causal and Bidirectional Patterns: The symmetry or directionality of self-attention weight matrices is determined by the training objective: encoder-only (bidirectional) models yield symmetric attention kernels, while decoder-only (autoregressive) models yield column-dominant, directional matrices. These properties can be exploited for faster convergence and improved interpretability (Saponati et al., 15 Feb 2025).

4. Memory Capacity, Entropy, and Expressive Limitations

Self-attention imposes a fundamental bottleneck on sequence-level working memory. In decoder-only Transformers trained on $K$ 2-back tasks, the model's prediction accuracy on retrieving the symbol at $K$ 3 degrades sharply with increasing $K$ 4, despite the context window being much larger than $K$ 5. Mechanistically, attention mass initially spreads uniformly but sharpens over training onto the $K$ 6 diagonal; however, as $K$ 7 increases, the softmax distribution loses focus—quantified via increasing total entropy $K$ 8—thereby diluting the signal and inducing "working memory" collapse (Gong et al., 2024).

This entropy-driven limitation parallels human executive attention theory, tying Transformer working memory directly to the dispersal properties of softmax-attention over distractors. Architectural remedies may entail explicit memory slots, enhanced positional encodings, or specialized long-range modules that decouple memory capacity from single-step attention entropy.

$K$ 9 (N-back)	Test acc.	Entropy $V$ 0
1	high ( $V$ 1)	low
3	falls log. with $V$ 2	increase w/ $V$ 3
6	near chance	high

Larger $V$ 4 implies more diffuse attention, higher $V$ 5, and lower accuracy—demonstrating an inherent, quantitative capacity threshold in self-attention.

5. Causal Interpretability and Structured Information Flow

A pre-trained Transformer's attention matrix can be interpreted as the coefficient matrix of a linear structural equation model (SEM) over token representations. Under this view, the contextualized output for each token is generated by a linear combination of all token embeddings weighted by elements of the attention matrix, with residual exogenous noise. The attention matrix thus encodes the direct causal-effect coefficients among tokens.

This mapping enables zero-shot, constraint-based causal graph recovery using partial correlations computed from the SEM covariance structure—realizing practical causal discovery on top of pretrained, frozen Transformer models (Rohekar et al., 2023). The mechanistic flow of information can be traced through products of attention matrices, revealing pathway-specific aggregation, local-to-global integration across layers, and offering insights into model interpretability and adversarial vulnerabilities (Hao et al., 2020, Wu et al., 2020).

6. Inductive Biases, Generalization, and Practical Implications

Self-attention inherits inductive biases from positional encoding (absolute, relative, translation-invariant), windowing constraints, and affinity normalization. Translation-invariant self-attention parameterizes position-bias via a small set of learnable functions over relative offsets, achieving near or superior performance to absolute-encoding Transformers with a minuscule parameter increase and full generalization to sequences longer than those seen in training (Wennberg et al., 2021).

The modeling flexibility of self-attention underlies its adaptability to multiple domains (NLP, vision, audio, graph), but also necessitates careful architectural regularization—e.g., batch-normalized attention or head downsampling—for redundancy reduction and computational efficiency (Nguyen et al., 2024). Combining self-attention with convolutional or active-memory modules can exploit complementary strengths: convolutions favor local, simultaneous multi-token interaction, while self-attention excels at long-range dependency modeling (Dowdell et al., 2019).

7. Future Directions and Limitations

Self-attention-based architectures are at the center of research into Transformer scaling, interpretability, and efficiency. Open challenges include the design of memory-augmented or entropy-suppressing attention modules to overcome the inherent working memory bottleneck; exploration of causal and semantic structure encoded in attention graphs; more aggressive parameter sharing and locality injection for efficient modeling; and the integration of affinity-based perspectives for generalized, task-adaptive information routing.

The growing mechanistic and architectural understanding of self-attention, especially regarding its entropy-driven capacity ceiling and contextual aggregation strategies, is poised to drive further innovation in architectures targeting long-context reasoning, sequence memory, and robust representation learning (Gong et al., 2024, Roffo, 19 Jul 2025, Saponati et al., 15 Feb 2025).