Attention Sink in Transformers

Updated 5 November 2025

Attention sink is a phenomenon where low semantic tokens (e.g., [BOS], punctuation) attract excessive attention across transformer layers due to softmax normalization.
This phenomenon is identified when token attention scores exceed α/N, impacting model stability, optimization, and interpretability in diverse architectures.
Techniques like Attention Calibration (ACT) effectively redistribute attention, resulting in accuracy gains up to +7.3% while preserving key model functionalities.

Attention sink is a recurrent and structurally significant phenomenon in transformer-based deep learning models, characterized by the systematic and often disproportionate allocation of attention scores to specific tokens, typically those with low or no semantic importance (such as the sequence-initial token), across layers and heads. Originally observed in autoregressive LLMs, attention sinks are now recognized as a widespread emergent property in diverse model architectures, input modalities, and application domains, with direct consequences for optimization, interpretability, robustness, efficiency, and model control.

1. Definition and Mechanistic Origins

An attention sink is a token in the model's input sequence that receives a sharply higher cumulative attention score from other tokens, often irrespective of the token’s semantic relevance. More formally, in a transformer decoder or encoder-decoder, for an attention head $h$ and layer $l$ , the (query-to-key) attention matrix is defined as: $\mathbf{A}_h^l = \mathrm{Softmax}\left( \frac{f_Q^l(\mathbf{X}^l) \cdot f_K^l(\mathbf{X}^l)^\top}{\sqrt{d_k}} \right)$ A token $i$ is designated an attention sink relative to a hyperparameter $\alpha > 1$ and input sequence length $N$ if: $a_h^l[i] = \frac{1}{i} \sum_{j=1}^i \mathbf{A}_h^l[i, j] > \alpha / N$ where $a_h^l[i]$ is the mean attention score received by token $i$ across a query row.

The origin of attention sinks lies in the probabilistic constraint enforced by the softmax normalization over tokens. Since the attention weights must sum to one, the model distributes “surplus” attention, especially when no strong contextual matches are present, preferentially towards certain positions—typically the sequence-initial token in autoregressive models, and more generally towards non-semantic tokens in any context. These tokens act as “anchors” or “reservoirs” to absorb excess attention, a byproduct of architectural inductive biases and training objectives (Gu et al., 14 Oct 2024, Xiao et al., 2023).

2. Structural and Empirical Distribution

While the initial attention sink was first observed almost exclusively at the first position (e.g., <s> or [BOS]), systematic empirical investigation has revealed that attention sinks can also emerge throughout the input sequence, particularly on tokens with low semantic information, such as punctuation, line breaks, or special formatting tokens (Yu et al., 22 Jun 2024). Their spatial and layerwise distribution shows characteristic properties:

Most frequent at sequence head, but not confined to it.
Prominent in intermediate layers; less so in initial or final layers.
Non-semantic tokens (e.g., “.”, linebreak, or arbitrary non-content markers) often act as attention sinks.
Histograms of per-position sink occurrence typically show a strong peak at the initial position, and a nontrivial distribution elsewhere.

Attention sinks are empirically detected by averaging attention maps over layers, heads, and inference contexts—sharp vertical stripes in these maps indicate sink tokens (Yu et al., 22 Jun 2024, Xiao et al., 2023).

3. Impact on Model Performance and Interpretability

The effects of attention sinks are nuanced and highly position-dependent:

Beneficial cases: The initial sink (first token) is empirically and theoretically beneficial to model stability, particularly in autoregressive and long-context settings. It acts as a safeguard against “forgetting” and mitigates over-mixing pathologies (Yu et al., 22 Jun 2024, Barbero et al., 3 Apr 2025). In windowed or streaming attention, retaining attention sink tokens in the context cache is necessary for stable inference—removing them induces catastrophic loss (e.g., 1000 $\times$ increase in perplexity) (Xiao et al., 2023).
Detrimental cases: Distributed or intermediate sinks, especially those not aligned with useful semantics, impair model performance. They divert attention from semantically relevant context, reducing effective context length and lowering achievable accuracy across both multiple-choice and open-ended tasks. Directly reducing and redistributing attention from such sinks (excluding the initial sink) reliably improves accuracy on benchmark datasets, with observed per-task/model gains ranging from 0.3% to 7.3% (Yu et al., 22 Jun 2024).
Interpretability: The presence of spurious sinks complicates map interpretability and can introduce misleading structure in attribution analyses.

4. Attention Calibration and Direct Manipulation

The recognition that attention sinks can be detrimental when uncontrolled motivated the development of the Attention Calibration Technique (ACT), a robust, training-free inference-time procedure (Yu et al., 22 Jun 2024). ACT proceeds in two stages:

Head Selection: A lightweight offline routine identifies attention heads and layers for which sink calibration improves performance (using a small held-out set).
Dynamic Calibration: At inference, for each selected head/layer and input, tokens with $a_h^l[s] > \alpha / N$ are detected as sinks. Their received attention is reduced by a scale $\beta$ ( $0 < \beta < 1$ ; default $\beta=0.4$ ), and the subtracted attention is reallocated among non-sink tokens proportionally within the row, ensuring normalization. No model parameters are updated; all adjustment is per-input and per-head.

Empirical application of ACT consistently and robustly improves accuracy, with effects strongest for large models (up to +7.3% for Llama-30B). Parameter sensitivity is low, and the choice of $\alpha$ , $\beta$ , head selection, or dataset sample has minimal impact on robustness. Visualizations confirm that ACT maintains initial attention sinks while eliminating irrelevant/late sinks, sharpening focus on semantically grounding context (Yu et al., 22 Jun 2024).

5. Theoretical Foundations and Broader Connections

Recent research posits that attention sinks (and the related phenomenon of compression valleys) originate from the emergence of massive activations in the residual stream, which morphologically compress the internal representations into near rank-1 matrices (Queipo-de-Llano et al., 7 Oct 2025). The theoretical bounds show that a single token (e.g., [BOS]) with disproportionately large norm in the hidden space ensures singular value dominance, dramatic entropy reduction, and sharp attention focusing: $\sigma_1^2 \geq M + \alpha R, \qquad p_1 \geq \frac{c+\alpha}{c+1}, \qquad H(\mathbf{X}) \leq -p\log p - (1-p)\log(1-p) + (1-p)\log(r-1)$ where $M$ is the sink norm, $R$ the sum of others’ norms, $c = M/R$ , $p$ is proportional variance, and $H$ is the entropy of representations.

This framework explains why the computational depth of LLMs can be partitioned into three phases: early mixing, compression (driven by attention sinks), and late selective refinement. Embedding and classification tasks benefit most from compressed (sink-dominated) mid-layers, while generative objectives prefer the re-expanded representations in late layers (Queipo-de-Llano et al., 7 Oct 2025, Barbero et al., 3 Apr 2025).

6. Practical Implications, Usage, and Caveats

Attention sink properties yield practical engineering levers:

Inference Acceleration: Knowing that only the initial sink token is required enables aggressive context truncation in streaming or long-form inference, provided the sink is retained in the cache (Xiao et al., 2023).
Quantization & KV Cache: Sink tokens play an outsized role in attention computation. Preserving their key and value tensors at higher precision during KV cache quantization is necessary to avoid catastrophic performance drops—a finding underlying recent quantization schemes (Su et al., 6 Aug 2025).
Security and Robustness: Unanticipated manipulation of attention sinks, including relocation or targeted input perturbations, can open vulnerabilities, enabling backdoor access, evaluation evasion, or hallucination attacks in downstream applications (Shang et al., 19 Oct 2025, Wang et al., 25 Jan 2025).
Interpretability and Diagnostics: Visualization and targeted calibration of attention sinks provide insight into model information flow and potential biases but must distinguish between useful (initial position) and detrimental (late, non-semantic) sinks.

Empirical evidence supports the universality of attention sinks in classic Transformer models trained with softmax normalization. However, the phenomenon can be removed or modulated by architectural modifications to the scoring kernel (e.g., using kernelized or sum-free attention, or alternative normalization), suggesting that attention sinks are contingent, not fundamental (Gu et al., 14 Oct 2024, Zuhri et al., 29 Apr 2025).

7. Summary Table

Aspect	Findings and Implications
Definition	Tokens (often low semantic) receiving abnormally high attention
Occurrence	Initial token most frequent, but not unique; present throughout input
Identification Criterion	Token-wise attention score $a_h^l[i] > \alpha/N$
Impact on LLM Accuracy	Initial sink is beneficial; others typically detrimental
Calibration Technique	Training-free per-head sink reduction & redistribution (ACT)
Typical Accuracy Gains	$+0.3$ %–$7.3$% (across models/tasks), robust to parameter choice
Usage	Plug-in at inference; orthogonal to prompt design or fine-tuning
Broader Implications	Efficient tuning, interpretability, security, context management

Attention sink thus represents both a theoretical challenge and practical opportunity. Understanding and managing this phenomenon is central to the development of interpretable, controllable, and high-performing transformer-based systems across text and multimodal domains.