Attention Sinks in Transformer Models

Updated 9 October 2025

Attention sinks are structural features in transformer models where specific tokens absorb excessive attention, acting as stable anchors in high-dimensional spaces.
They emerge from softmax normalization, optimization dynamics, and positional encoding, influencing model stability and preventing over-mixing of token representations.
Attention sinks enable efficient streaming, model compression, and controlled quantization while also revealing potential pitfalls in interpretability and efficiency.

Attention sinks are a structural phenomenon in transformer-based models, in which certain tokens—often initial tokens, special markers, or ones with specific positional or activation characteristics—attract a disproportionately large share of the attention distribution from other tokens. These tokens act as “anchors” or “sinks,” absorbing excess attention, often independently of semantic content. Attention sinks have been observed across a broad spectrum of architectures (including autoregressive LLMs, encoders, vision transformers, multimodal and video transformers) and play both functional and pathological roles in information flow, model optimization, efficiency, and interpretability.

1. Mathematical Definition and Geometric Origins

The canonical transformer attention mechanism computes, for each query token $q_i$ and keys $k_j$ , a weight: $\alpha_{ij} = \frac{\exp\left(\frac{q_i \cdot k_j}{\sqrt{d}}\right)}{\sum_{l=1}^{n} \exp\left(\frac{q_i \cdot k_l}{\sqrt{d}}\right)}$ mapped onto the probability simplex $\Delta^{n-1}$ . This transformation, via the softmax, introduces intrinsic curvature (Fisher-Rao metric) and enforces “zero-sum” conservation, creating competitive pressure among tokens.

A token $t^*$ is an attention sink if, for most queries $i$ ,

$\alpha_{i,t^*} \gg \alpha_{i,j} \text{ for } j \ne t^*.$

This “vertical” pattern in attention matrices is seen as a prominent column, with many attention heads assigning significant weight to the same token—even across model architectures and data modalities (Gu et al., 14 Oct 2024, Lu et al., 21 Jul 2025, Ruscio et al., 4 Aug 2025).

Geometrically, attention sinks are an emergent solution for establishing a stable reference frame in high-dimensional representational spaces (Ruscio et al., 4 Aug 2025). In decoder-only models with rotary embeddings (RoPE), the initial token receives zero rotation and dominates as a “centralized” anchor; in encoder-only models (e.g., BERT variants), boundary tokens, such as [CLS] and [SEP], establish bidirectional anchors; in architectures with scaled position encodings, multiple reference points (distributed sinks) may emerge. These anchors are formal solutions to the geometric constraint of basis stabilization after mapping onto the simplex.

2. Mechanistic Origins: Optimization, Architecture, and Parameter Dynamics

The emergence of attention sinks is tied to optimization dynamics, data distribution, loss functions, and architectural choices.

Optimization: After sufficient training, the first token typically becomes a sink as its key vector aligns in angle with most queries. Larger learning rates and moderate weight decay accelerate the emergence of this phenomenon, while increased weight decay beyond a threshold may suppress it (Gu et al., 14 Oct 2024).
Data Distribution: The location of attention sinks often reflects data packing. Fixing a particular token, or altering packing strategies, can move the sink position, demonstrating a strong dependency on pretraining conventions.
Loss Function: The autoregressive loss, which ignores the prediction of the first token, privileges it as an attention sink. Modifying the loss (e.g., using prefix LMs or loss reweighting) can influence where sinks form.
Architecture: Pre-norm/post-norm choices, position encoding (absolute, learnable, rotary, ALiBi, NoPE), value and key initializations, and attention normalization (softmax vs. alternatives) all modulate sink strength but do not remove the fundamental phenomenon. Notably, switching from softmax normalization to sigmoid or non-normalized attention eliminates sink emergence up to moderately sized models (Gu et al., 14 Oct 2024, Guo et al., 17 Oct 2024).

Softmax-based normalization, which enforces row sums to unity, causes interdependence among token attention weights. This bias is further exacerbated by the presence of tokens with unusually high cosine similarity between keys and queries, driving up their attention scores. In practice, the first token can have extremely high $\ell_2$ activation norms, and both its key and value vectors may be suppressed in norm yet perfectly aligned in direction, resulting in large softmax outputs but negligible value contributions (creating so-called “dormant” heads (Sandoval-Segura et al., 4 Apr 2025)). The mutual reinforcement mechanism, observed in simplified toy models and in pretraining dynamics, describes a feedback loop where heads shift more attention to tokens with small value states, whose subsequent value suppression (due to irrelevance for prediction) only strengthens their role as sinks (Guo et al., 17 Oct 2024).

3. Functional Roles: Information Flow, Compression, and Bottlenecks

Attention sinks are not mere pathologies—they fundamentally structure information flow in deep transformer models. The presence of a strong sink can:

Prevent Over-Mixing & Rank Collapse: By concentrating attention on an anchor with little semantic contribution, models avoid uncontrolled mixing, thereby preserving meaningful inter-token distinctions and preventing representational collapse (Barbero et al., 3 Apr 2025, Queipo-de-Llano et al., 7 Oct 2025).
Induce Compression Valleys: When a sink token’s activation norm becomes orders of magnitude larger than the others (massive activation), the singular value spectrum of the token representation matrix collapses, yielding a low-entropy (“compression valley”) phase (Queipo-de-Llano et al., 7 Oct 2025). This acts as a bottleneck, limiting the subspace of representation and forcing the model into a constrained geometry—a mechanism supported theoretically by lower bounds on the dominant singular value:

$\sigma_1^2 \geq M + \alpha R,$

for $M = \|\mathbf{x}_0\|^2$ , $R = \sum_{i\ne 0} \|\mathbf{x}_i\|^2$ , and $\alpha = (1/R)\sum_{i\ne 0} \|\mathbf{x}_i\|^2 \cos^2\theta_i$ (Queipo-de-Llano et al., 7 Oct 2025).
Enable Efficient Streaming & KV Cache Optimizations: In streaming language modeling, retaining a single sink token in the cache suffices for stable predictions, even when all but the most recent tokens are discarded. The StreamingLLM framework demonstrated that retaining just a trained sink placeholder enables stable long-context inference and achieves up to 22.2× speedup in streaming deployments (Xiao et al., 2023). Further, dedicated sink tokens (“Zero Sink” or learned sink placeholders) permit more aggressive cache truncation without loss in perplexity.
Regulate Multi-Head Dynamics: Many attention heads, identified as dominated by sink tokens, produce near-zero outputs—a class termed “dormant heads.” Studies demonstrate that 14% or more of such heads can be disabled during inference with negligible performance loss (Sandoval-Segura et al., 4 Apr 2025). These heads emerge early in pretraining and may transition between dormant and active states, their prevalence varying with text characteristics.

4. Applications in Vision, Recommender, and Multimodal Systems

Vision Transformers and ViTs

Vision transformers manifest attention sink phenomena as “massive tokens” in mid-to-late layers, with certain patch tokens or [CLS] receiving excessive attention. These tokens are identified by their exceptionally high activation norms, sometimes exceeding that of the global CLS. Their competition with artifact tokens, revealed during masking ablations, regulates information flow and can be used to motivate structured approximations such as Fast Nyström Attention (FNA), which exploits low-rank structure for efficient self-attention computation. Masking massive and artifact tokens further improves global feature aggregation for classification, retrieval, and segmentation, with minimal cost.

The EDIT architecture (Encoder-Decoder Image Transformer) mitigates such sinks by decoupling patch processing from [CLS] aggregation, using dedicated cross-attention in a layer-aligned decoder. This allows interpretable, progressive refinement from low-level image features without overwhelming attention on the CLS.

Multimodal and Video Transformers

In large multimodal models (LMMs), visual attention sinks appear as visual tokens with large activations in fixed “sink dimensions” (Kang et al., 5 Mar 2025). These tokens attract high attention from text queries but often contribute little useful information. The Visual Attention Redistribution (VAR) method recycles this surplus attention, reallocating it among informative visual tokens targeted by “image-centric” heads, improving performance on tasks ranging from VQA and hallucination to vision-centric benchmarks without additional training.

Video diffusion transformers exhibit attention sinks in later layers, where certain temporal/spatial tokens absorb almost all attention (“vertical stripes” in attention maps), yet contribute nearly nothing due to low value norms (Wen et al., 14 Apr 2025). Such heads are readily prunable, serving as targets for sparsification and efficient computation.

Recommender and Sequence Models

CTR-Sink is a representative method that inserts behavior-specific [SINK] tokens—constructed with external signals such as temporal distance—between sequential user behaviors (Li et al., 5 Aug 2025). These specially tuned sinks act as behavioral boundaries and absorb scattered attention, reducing semantic fragmentation. Two-stage training (sink-token-focused, then full-sequence fine-tuning) and bias-enhanced sink-to-sink connectivity significantly improve click-through rate prediction.

5. Model Compression, Quantization, and Efficiency

Attention sinks play a prominent role in extreme activation outlier propagation during forward passes (Son et al., 17 Jun 2024, Su et al., 6 Aug 2025). Their unique numerical characteristics (low QKV norms but high cosine similarity with queries) make them highly sensitive to quantization error and prone to introducing bias into quantization parameter calibration. KV cache quantization methods that blindly quantize all tokens perform poorly because attention sink tokens require preservation of high precision to maintain attention distributions.

KVSink predicts sink tokens by detecting “stable outliers” (channels or token positions in hidden activations with top magnitude at a designated layer) and preserves these tokens at higher precision during quantization. This plug-and-play approach surpasses first-N token preservation (PFN), improving perplexity under quantization and reducing hardware overhead (Su et al., 6 Aug 2025).
CushionCache inserts a learned prefix sequence, acting as an anchor, before the target tokens, which mitigates extreme activation outliers and drastically reduces per-tensor quantization error (Son et al., 17 Jun 2024). The prefix is found via greedy search and then quantization-aware tuning, yielding major improvements in perplexity and accuracy under low-bit quantization.

The catch, tag, and release mechanism (Zhang et al., 2 Feb 2025) further ties the presence of attention sinks to the preservation of low-rank structure in the query, key, and value parameter matrices—implying that compression methods maintaining low-rank components (e.g., OATS) better preserve few-shot and chain-of-thought performance.

6. Theoretical Perspectives and Broader Implications

A geometric view synthesizes these findings: attention sinks are optimal solutions to the problem of anchoring reference frames in high-dimensional, normalized feature spaces (Ruscio et al., 4 Aug 2025). Centralized, distributed, and bidirectional reference frames (anchored by initial, several, or boundary tokens, respectively) are determined by architectural details such as positional encoding and information mapping onto the probability simplex.

A unified “mix-compress-refine” theory emerges: transformers process inputs with broad attention mixing (high entropy, diffuse mixing) in early layers, shift to a compression phase with strong attention sinks and low-entropy representation (bottleneck), and finish with selective refinement (norms re-equalize, attention sharpens) in late layers (Queipo-de-Llano et al., 7 Oct 2025). This tripartite structure illuminates why embedding tasks peak at intermediate layers, while generative tasks require deeper processing.

Attention sinks are thus pivotal for:

Stabilizing information propagation and mitigating detrimental over-mixing (Barbero et al., 3 Apr 2025, Queipo-de-Llano et al., 7 Oct 2025).
Enabling robust, efficient streaming and improved memory management.
Structuring attention for downstream interpretability and efficiency-driven pruning.
Guiding architectural choices (e.g., positional encoding, presence/absence of explicit sink tokens) and inspiring practical interventions in model compression, quantization, and retrieval.

However, not all attention sinks are functional: empirical calibration and redistribution methods (e.g., ACT (Yu et al., 22 Jun 2024), VAR (Kang et al., 5 Mar 2025), EAH (Zhang et al., 15 Nov 2024)) reveal that some sinks (especially on punctuation or trivial tokens) are detrimental, requiring selective suppression or redistribution.

7. Summary Table: Characteristics and Implications of Attention Sinks

Category	Mechanism/Definition	Key Implications
Emergence	Softmax normalization, geometric anchor	Unavoidable under standard architectures
Detection	High attention weights on 1–few tokens	“Vertical” stripes, dormant heads
Roles	Over-mixing prevention, compression phase	Streaming, quantization, compression
Practical Fixes	Dedicated sink tokens, position encodings	KVSink, CushionCache, ACT, VAR, EAH
Pathologies	Dormant heads, uninformative sinks	Efficiency loss, possible accuracy loss
Applications	Language, vision, multimodal, recommender	Retained or mitigated for best accuracy

Attention sinks are now understood as central organizing principles within transformers, structuring both representation geometry and information dynamics, while simultaneously presenting opportunities and pitfalls across a spectrum of modeling, deployment, and interpretative contexts.