Attention Sinking in Transformers

Updated 4 March 2026

Attention sinking is an emergent phenomenon in transformer models where attention disproportionately accumulates on tokens such as the BOS, irrespective of semantic content.
Research demonstrates that parking attention on low-norm, inert tokens mitigates representational collapse and stabilizes information flow across deep layers.
Interventions targeting attention sinks can enhance model efficiency through improved streaming inference, compression, and safety in large language models.

Attention sinking is a pervasive, emergent phenomenon in transformer-based architectures—in both language and multimodal models—whereby certain tokens attract a disproportionate share of self-attention from other sequence positions, often irrespective of their semantic content. Most commonly, this manifests as the concentration of attention on the initial token of a sequence (the BOS or special marker) across a large fraction of attention heads and layers. While initially regarded as a curiosity or inefficiency, recent research has rigorously demonstrated that attention sinking is both theoretically and practically crucial for the stability, capacity, robustness, and specialization of modern LLMs and related architectures.

1. Mathematical Foundations and Mechanistic Origins

The core mechanism of attention sinking arises from the combination of causal masking and the normalization constraint imposed by the softmax operation in self-attention layers. In a standard multi-head attention block, for a head $h$ at layer $\ell$ , the attention weights take the form:

$\alpha_{ij}^{(\ell,h)} = \mathrm{softmax}_j \left( \frac{q_i^{(\ell,h)} \cdot k_j^{(\ell,h)}}{\sqrt{d}} \right)$

where $q_i, k_j$ are the query and key projections, and $d$ is the dimensionality. Due to causal masking, only $j \leq i$ terms are considered.

In the regime where no strong semantic match exists between a query and its candidate keys ("attention underload"), softmax normalization still enforces a unit-sum constraint, resulting in spurious allocation of attention mass. Empirically, this "waste" is consistently deposited on the first token, the BOS marker, forming an "attention sink" (Barbero et al., 3 Apr 2025, Fu et al., 1 Jan 2026, Gu et al., 2024). This pattern persists and intensifies as models become deeper, context lengths grow, or softmax normalization is preserved.

A key insight is that by “parking” attention on a semantically inert, low-norm BOS token, specific heads in the model decouple themselves from further mixing, effectively slowing down representational collapse (reduction of token diversity in the representation space) and rank collapse (contraction of activations toward a constant mean in the hidden dimension) (Barbero et al., 3 Apr 2025). This sandbags the exponential tendency toward over-mixing in deep transformers.

2. Emergence, Prevalence, and Structural Determinants

Attention sinks are not an accidental byproduct of pretraining idiosyncrasies; rather, they systematically emerge across architectures, data domains, and scales:

Sink formation is almost universal in standard causal LLMs, present through optimization on sufficient data regardless of the semantic value of the first token (Gu et al., 2024).
Larger context windows and deeper models promote higher sink-ratios and more heads acting as sinks (e.g., from 0% at context=128 to 65–75% by context=2048; from 46% to 78% sinks as models scale from 8B to 405B params) (Barbero et al., 3 Apr 2025).
Sinks can be explicitly controlled by tokenization and data packing strategies: fixing BOS at position 1 ensures almost 90% of heads learn sinks there; randomizing or eliminating forces the model to select the next stable anchor (Barbero et al., 3 Apr 2025, Gu et al., 2024).
In encoder-only or bidirectionally attended models, attention sinks are less tied to position 1, but high-sink tokens still emerge as geometric reference points for information flow (Ruscio et al., 4 Aug 2025).

Sink tokens are constructed from a combination of (a) key vector norms and orientations, (b) softmax-induced simplex geometry—where placing one or more simplex vertices at a distinguished position creates a geometric anchor—and (c) positional encoding methods (e.g., standard RoPE, NTK-scaled RoPE, or absolute embeddings, each favoring different sink patterns) (Ruscio et al., 4 Aug 2025).

3. Functional Roles in Representation and Dynamics

Recent theoretical and empirical studies establish multiple, interlocking roles for attention sinks:

Regulation of Over-Mixing and Representational Collapse

Repeated mixing via multi-head attention, especially in deep or long-context transformers, rapidly drives token representations toward a low-dimensional or even constant subspace (“rank collapse”). By allocating substantial attention to a low-norm, inert sink token, many heads effectively “mute” themselves, localizing information propagation and preventing premature mixing (Barbero et al., 3 Apr 2025).

This is formalized via bounds on the Jacobian $\| J_{ij}^{(L)} \| = \big\| \frac{\partial v_j^{(L)}}{\partial v_i^{(0)}} \big\|$ , showing that heads concentrating on sinks reduce the operator norm of long-range gradients—directly slowing representational collapse.

Redundancy, Specialization, and Mixture-of-Experts

Heads with persistent sink behavior are often functionally redundant: ablation studies show that they can be removed with negligible performance impact, especially in deeper layers (Sok et al., 11 Jan 2026, Sandoval-Segura et al., 4 Apr 2025). Conversely, the propensity to become an attention sink acts as a native gating mechanism within attention blocks, yielding an implicit Mixture-of-Experts (MoE): each head is gated by $G_t^{\,l,h} = 1 - A^{\,l,h}_{t,0}$ , with some heads specializing, others collapsing onto the sink (Fu et al., 1 Feb 2026).

Quantization, Streaming, and Practical Deployment

Sink tokens serve as structural pivots for robust quantization (preserving full-precision key/value projections of sink tokens greatly mitigates errors in low-precision KV-cache) and enable efficient streaming inference (StreamingLLM maintains sink KVs across arbitrarily long context with only a very small permanent cache) (Xiao et al., 2023, Su et al., 6 Aug 2025).

Sinks also stabilize performance under context window extension and compression regimes across language and multimodal audio-visual models (Anand et al., 26 Oct 2025, Wong et al., 22 Dec 2025, Su et al., 6 Aug 2025).

4. Measurement, Taxonomy, and Generalizations

Quantitative assessment of attention sinks employs the “sink-score,” typically defined as:

$\text{sink-score}_k^{(\ell,h)} = \frac{1}{T - k} \sum_{t=k}^{T-1} \alpha_{t k}^{(\ell,h)}$

or, for BOS, the layer-head-average $\frac{1}{T}\sum_{t=0}^{T-1} \alpha_{t,0}^{(\ell,h)}$ (Sok et al., 11 Jan 2026, Wong et al., 22 Dec 2025). For detection, heads are flagged as “dormant” or “sink-dominated” if their sink-score exceeds thresholds (often 0.8–0.9), often cross-validated by output norm suppression (Sandoval-Segura et al., 4 Apr 2025).

Recent research identifies a spectrum of sink types, not limited to the BOS or initial anchors:

Primary sinks appear early and persist through all layers (canonical BOS, system prompt tokens) (Wong et al., 22 Dec 2025).
Secondary sinks emerge in middle layers with limited lifetime, created via specific MLP projections that align arbitrary tokens with the BOS sink direction. Their formation is closely tied to the $\ell$ 0-norm of the creating MLP residual output (Wong et al., 22 Dec 2025).
In multimodal and vision-LLMs, analogous "visual attention sinks" and "ViT sinks" are detected, often corresponding to high-norm image patch tokens capturing global context but not necessarily semantically relevant, yet leading to focused attention and better reasoning when utilized (Luo et al., 9 Oct 2025, Kang et al., 5 Mar 2025).

Sinks also manifest in masked diffusion LLMs (DLMs) and audio-visual LLMs, appearing at dynamically shifting positions and correlating with both stability and generalization, but with distinct robustness profiles compared to strictly causal models (Rulli et al., 17 Oct 2025, Anand et al., 26 Oct 2025).

5. Applications, Interventions, and Safety Implications

Compression and Pruning

Sink-aware pruning strategies successfully identify functionally redundant heads and layers, enabling aggressive model sparsification with minimal loss in language understanding or reasoning benchmarks (Sok et al., 11 Jan 2026, Sandoval-Segura et al., 4 Apr 2025). Structured redundancy, especially in deeper layers, is well-explained by the tendency for certain heads to function stably as attention sinks.

Streaming and KV-Cache Management

For windowed and streaming inference, the preservation of sink token KVs (even if semantically trivial) is critical; removing these tokens from the cache collapses model perplexity and output stability. Techniques such as StreamingLLM and KVSink implement this via permanent sink caches or dynamic sink detection (Xiao et al., 2023, Su et al., 6 Aug 2025).

Mitigation of Spurious and Harmful Effects

Dedicated interventions can suppress, redistribute, or even exploit attention sink behavior:

Lazy Attention introduces elastic-softmax and head-wise positional discrimination to sparsify or eliminate unwanted sinks, sharpening attention focus (Fu et al., 1 Jan 2026).
Visual Attention Redistribution (VAR) reclaims attention budget from visually-irrelevant sink tokens, reallocating it to salient visual regions for improved multimodal model reasoning (Kang et al., 5 Mar 2025).
CTR-Sink constructs explicit task-specific sinks (anchors) in recommendation settings to compensate for semantic fragmentation, boosting predictive signal (Li et al., 5 Aug 2025).
Surgery regularization suppresses heads with harmful sink divergence, reducing the tendency of models to learn or amplify toxic behaviors during fine-tuning (Liu et al., 5 Feb 2026).
Backdoor and Safety Risks: The architectural regularity of sinks can be exploited by adversarial attacks: placing triggers at sink positions enables persistent, stealthy backdoor unlearning and reinstatement (Shang et al., 19 Oct 2025).

6. Theoretical and Geometric Interpretation

A geometric perspective interprets attention sinks as anchors establishing reference frames in the space of token representations. Under the simplex geometry of softmax attention, the model’s optimization objective naturally selects one or more origin points—centralized (BOS), distributed (multiple anchors), or bidirectional (start and end positions)—depending on position encoding schemes and regularization (Ruscio et al., 4 Aug 2025). This coordinate-system view subsumes both the emergence of sink tokens and their role in structuring information propagation throughout the network.

Spectral analysis shows that attention sink behavior is functionally tied to the exchange of information in the "U-dark" (low singular value) subspace of the unembedding matrix, allowing the model to “hide” large residual updates in directions that minimally perturb the output distribution (Cancedda, 2024).

7. Broader Implications and Future Directions

Understanding attention sinking unifies disparate reports of model instability, functional redundancy, compression anomalies, and streaming pathologies in LLMs. It establishes the necessity of explicit design choices in data packing, tokenization, and head specialization, and motivates interventions that regularize, repurpose, or suppress sink behavior in application-specific contexts. Future research is expected to expand on:

Adaptive and data-driven normalization schemes to replace or augment softmax, controlling sink prevalence (Fu et al., 1 Jan 2026, Gu et al., 2024).
The exploitation and suppression of sinks in safety, fairness, and backdoor mitigation (Liu et al., 5 Feb 2026, Shang et al., 19 Oct 2025).
Deliberate architectural design to steer reference frame geometry for specific inductive biases and efficiency objectives (Ruscio et al., 4 Aug 2025).
Extending and analyzing sink phenomena in emerging domains—multimodal, streaming, mixture-of-experts, and beyond.

The scientific consensus is that attention sinking is a principled, emergent mechanism—critical, sometimes problematic, and increasingly central to the theory and practice of large-scale transformer models (Barbero et al., 3 Apr 2025).