Attention Sink in Transformers

Updated 30 December 2025

Attention Sink Token is a phenomenon where a specific token (e.g., <BOS> or [CLS]) attracts most attention mass in transformer models.
It is defined by a dominant column in the self-attention matrix due to softmax normalization, with over 70% of heads showing sink behavior in many architectures.
This behavior affects model efficiency, optimization, and robustness, influencing strategies in streaming, quantization, pruning, and mitigation techniques.

An attention sink token is a consistent and pervasive phenomenon in transformer-based architectures, especially LLMs, vision-LLMs, and multimodal transformers. An attention sink is a token (most often the first token, such as <BOS>, or special separators like [CLS]/[SEP]) that systematically attracts a disproportionately large share of attention mass across attention heads and layers, regardless of its semantic content. This property is intrinsic to models using softmax-attention normalization and has far-reaching implications for model optimization, interpretability, deployment strategies, and the development of architectural and algorithmic mitigation techniques.

1. Formal Definition, Mathematical Formulation, and Empirical Detection

The attention sink phenomenon manifests when the self-attention matrix for a given head displays a pattern whereby most queries assign a significant fraction of their total attention to a single position $j^*$ (the sink). The general mathematical form for the attention matrix in layer $l$ , head $h$ is: $A^{l,h}_{i,j} = \mathrm{Softmax}\left(\frac{Q^{l,h}_i (K^{l,h}_j)^T}{\sqrt{d_h}}\right)$ where $Q$ , $K$ are query/key projections of the previous-layer hidden states.

A concrete attention sink criterion is: $A^{l,h}_{i,s} \gg \frac{1}{T} \sum_{j \neq s} A^{l,h}_{i,j}$ for most $i > 1$ , i.e., the first column absorbs a much greater fraction of the mass than typical positions. An operational metric is

$\mathrm{Sink}_1^\epsilon = \text{fraction of heads with } A^{l,h}_{i,1} > \epsilon \text{ for most } i>1$

Thresholds like $\epsilon=0.3$ are standard; universality is demonstrated when $>70\%$ of heads in mainline models assign $>30\%$ of attention to token 1 (Gu et al., 2024). For general detection, high column sums and low entropy in attention maps are indicators.

Empirically, attention sinks appear not only at the beginning of sequences (BOS) but can also occur at later tokens—such as delimiters, punctuation, or specially designated sink tokens in domain-specific applications (Yu et al., 2024).

2. Mechanistic Origins: Softmax Constraint and Emergence in Training

The fundamental origin of the attention sink is the softmax normalization in attention: $\sum_j A^{l,h}_{i,j} = 1, \quad A^{l,h}_{i,j} \geq 0$ A small consistent bias in the query-key dot product in favor of a particular position (usually due to the structure of positional encodings and never-predicted status of BOS) is exponentially amplified, causing almost all the attention mass to aggregate on a single index.

During pretraining, attention sinks emerge as soon as the optimization is effective on sufficient data: sink metrics transition from zero to high values only after training loss has dropped (requiring full data coverage and suitable learning rates). The sink effect is tied to the loss function and data distribution; for example, with prefix-LM loss, the sink migrates to the prefix region (Gu et al., 2024).

Attention sink tokens correspond to learned "key-bias registers": $k_1^{l,h}$ is strategically placed so that $q_i^{l,h} \cdot k_1^{l,h}$ is systematically large for all $q_i$ , hoovering softmax probability. Value vectors at these positions are of vanishing norm, meaning the sink does not contribute useful content to the computation (Gu et al., 2024).

Alternatives to softmax, such as unnormalized sigmoid kernels or kernelized linear attention, break the sum-to-one constraint and do not exhibit attention sinks even in multi-hundred-million parameter models (Gu et al., 2024).

3. Universality, Variants Across Architectures, and the Geometric Perspective

Attention sinks are universal in decoder-only LLMs (GPT2, OPT, Pythia, LLaMA, Mistral, etc.), with $>70\%$ of heads showing sink behavior for the first token across datasets and fine-tuning/dataset regimes (Gu et al., 2024). In architectures with different positional encodings, the details vary:

Standard RoPE (centralized frame): single BOS sink, star-shaped topology (Ruscio et al., 4 Aug 2025).
Scaled RoPE (distributed frame): multiple anchor tokens act as local sinks (Ruscio et al., 4 Aug 2025).
Absolute position embeddings (bidirectional frame): [CLS]/[SEP] at both ends act as dual anchors; attention is distributed between them along depth.

Sinks arise to establish stable reference frames in high-dimensional token spaces, anchoring each token’s representation against positional drift and enforcing geometric consistency (Ruscio et al., 4 Aug 2025).

4. Functional Role: Over-Mixing Control, Efficiency, and Inductive Bias

The presence of an attention sink provides an effective mechanism to control over-mixing and over-squashing in deep, long-context models. By routing attention to the sink, the network slows the mixing of information and prevents the representational collapse that would occur if softmax attention were spread too thinly (Barbero et al., 3 Apr 2025). As context length and model depth increase, sinks become stronger to maintain controllable propagation of information.

The sink implements a kind of dynamic depth gating or “no-op” behavior: heads attending exclusively to the sink with low-norm values effectively switch themselves off, reducing unnecessary or harmful information flow (Barbero et al., 3 Apr 2025 Gu et al., 2024).

This behavior is not limited to language modeling. In vision transformers, especially ViTs and video diffusion transformers, sink phenomena manifest at the [CLS] token or certain positional anchors, potentially to the detriment of local feature extraction (Feng et al., 9 Apr 2025 Wen et al., 14 Apr 2025). In multimodal models, egregiously high attention sinks may emerge on irrelevant visual tokens, leading to inefficiency and degraded downstream semantics (Kang et al., 5 Mar 2025).

5. Practical Implications: Streaming, KV Cache, Quantization, Robustness, and Model Efficiency

Streaming/Long-context Generation

Attention sinks are a core enabler of efficient windowed-key-value (KV) cache management in streaming LLMs. By always retaining the K/V of a small set of sink tokens (often the first 4), models achieve stable perplexity and accuracy for sequence lengths far exceeding the training window, eliminating the need for expensive full-attention or recomputation (Xiao et al., 2023). When models are pretrained with a dedicated <SINK> token, only one sink needs to be cached (Xiao et al., 2023).

Quantization

Sink tokens are outliers in key/value activation norms. Quantization schemes must protect these entries (static “preserve-first-N” or dynamic methods like KVSink that predict sink tokens by tracing cross-layer outlier channels) to avoid catastrophic degradation, particularly at small bit-widths (Su et al., 6 Aug 2025). Sinks may occur at positions other than the very beginning, so hard-coded “first-N” heuristics are insufficient.

Model Compression and Pruning

Many heads whose attention is fully absorbed by sinks output near-zero norms (dormant heads) and can be omitted at inference—up to 14% per model with no impact on accuracy (Sandoval-Segura et al., 4 Apr 2025). Techniques like OrthoRank exploit the geometric movement of tokens toward the sink direction in hidden state space to dynamically prune uninformative tokens, thus improving computational efficiency (Shin et al., 5 Jul 2025).

Vulnerabilities and Adversarial Implications

Attention sink positions act as privileged highways for triggers in prompt injection and backdoor unlearning attacks, as their high attention-mass enables triggers placed at sink locations to hijack the output with high reliability and persistence (Shang et al., 19 Oct 2025). Defenses include randomization of sink locations and regularization on sink attention mass.

Multimodal and Vision-Transformer Models

In vision models, attention sinks at the [CLS] or spurious sink tokens can distort patch-level processing. Encoder-decoder architectures such as EDIT decouple patch processing (encoder) from [CLS] summarization (decoder) using cross-attention to mitigate sink over-centralization and improve both performance and interpretability (Feng et al., 9 Apr 2025).

Failure Modes and Over-smoothing

In encoder-only models (BERT, RoBERTa), attention sinks at [SEP] or similar delimiters induce oversmoothing, degrading single-task and continual learning performance by flattening representations and propagating cross-task interference. Pre-scaling strategies that reweight attention mass away from sink tokens yield substantial gains in continual learning benchmarks (Bai et al., 2024).

6. Mitigation and Control: Alternatives, Regularization, and Engineering Recommendations

Mitigation and control strategies include:

Replacing softmax: With unnormalized kernels (sigmoid, ReLU-plus-one), which break the sum-to-one constraint and eliminate sink formation (Gu et al., 2024).
Architectural interventions: Addition of explicit key-bias or bias registers to redirect sink mass in a controlled fashion (Gu et al., 2024).
Regularization: Entropy penalties, head dropout, or decorrelation losses between sink and non-sink tokens to ensure more uniform attention distributions and decorrelate hidden-state alignment (Anand et al., 26 Oct 2025).
Layer-wise or head-wise ablation: Dynamic detection and pruning of dormant heads/tokens based on hidden-state norms or attention mass (Sandoval-Segura et al., 4 Apr 2025 Shin et al., 5 Jul 2025).
Training-time interventions: Managing learning rates, data fit, prefix/packing procedures, and reconsidering Adam versus SGD or the ReLU versus softmax activation choice in attention mechanisms (Gu et al., 2024 Guo et al., 2024).

Control of attention sinks is critical for both eliminating pathologies and leveraging their functional role in long-context efficiency, model compression, or rapid few-shot adaptation via the “catch, tag, and release” mechanism (Zhang et al., 2 Feb 2025).

7. Broader Significance, Outstanding Questions, and Future Directions

Attention sinks are not an interpretability artifact or a purely negative feature; they are a structural property of softmax attention transformers arising from the requirements of geometric anchoring, optimization constraints, and representational stability. Their formation is rapid and robust during training, and their forms (centralized, distributed, bidirectional) are determined by the positional encoding scheme and problem domain (Ruscio et al., 4 Aug 2025).

Open questions include:

What is the full capacity–efficiency tradeoff curve for explicit sink engineering?
How can one dynamically manage or retrain sinks to optimize robustness, continual learning, or interpretability?
What alternatives to softmax or adaptive masking best combine the functional advantages of sinks with non-pathological attention patterns?

Expanding the geometric and mechanistic foundations promises more robust, efficient, and controllable transformer architectures across language, vision, and multimodal settings (Gu et al., 2024 Ruscio et al., 4 Aug 2025 Bai et al., 2024).