Sink Tokens in Transformer Models

Updated 6 March 2026

Sink Tokens are tokens in Transformer models that attract a disproportionately higher fraction of attention, often found in the first position.
Quantitative studies use metrics like sink rate, revealing that up to 90–99% of high-layer attention heads assign over 0.3 mass to the initial token.
These tokens influence information flow, model stability, and efficiency, while also posing risks such as backdoor vulnerabilities during inference and continual learning.

Attention sinks, also termed "sink tokens," are a recurring architectural and representational phenomenon in Transformer-based models where a small subset of tokens—most commonly the first position in a sequence—accumulate a disproportionately large fraction of attention mass across layers and heads, irrespective of semantic content. This phenomenon is universal across LLMs, vision-LLMs, and multimodal transformers, and has significant functional, algorithmic, and practical implications for model training, inference, compression, and design (Gu et al., 2024).

1. Formal Definition and Quantitative Characterization

Attention sinks are defined per attention head and layer as the token positions receiving a much larger mean attention weight than all others. In standard multi-head self-attention, for a head $h$ at layer $l$ , the attention weights are computed as

$\alpha_{ij}^{(l,h)} = \mathrm{Softmax}_j\left(\frac{q_i k_j}{\sqrt{d_h}}\right)$

where $q_i$ , $k_j$ are the query and key projections. The sink token is any position $j^*$ such that

$\alpha_{ij^*}^{(l,h)} \gg \mathbb{E}_{j\neq j^*} [\alpha_{ij}^{(l,h)}]$

across most query positions $i$ .

A widely used measurement is the sink rate:

$\mathrm{SinkRate}(\epsilon_s) = \frac{1}{L} \sum_{l=1}^L \frac{1}{H} \sum_{h=1}^H \mathbb{1}(\alpha_1^{l,h} > \epsilon_s)$

where $\alpha_1^{l,h}$ is the average attention assigned to the first token (often bos) in head $l$ 0, layer $l$ 1 (Zuhri et al., 29 Apr 2025, Gu et al., 2024).

Empirically, in large-scale pre-trained models, up to 90–99% of heads at high layers will assign $l$ 2 mass to the first token. Sinks also exist in small models once trained on sufficient data (≥500M tokens for 60M Tiny-LLaMA), and are tightly correlated with the loss function, data distribution, and architecture (Gu et al., 2024).

2. Mechanistic Origins: Softmax Normalization and Positional Anchors

Attention sink formation is primarily driven by the normalization constraint enforced by softmax attention. The sum-to-one property causes any persistent logit gap in the dot-product scores to result in a "dumping" of excess attention onto the dominant key token. In practical LLMs with relative positional encodings (e.g., RoPE), the first token often maintains a higher alignment to the queries across the sequence, even with a small norm. This establishes the first token as a reference anchor or coordinate frame—a geometric attractor in the model's representational space (Ruscio et al., 4 Aug 2025).

Mitigating or removing attention sinks is possible with alternative attention kernels that do not enforce normalization (e.g., unnormalized sigmoid, ELU+1, identity), where softmax is replaced with a rectified non-sum-to-one operation such as Softpick, leading to sink rates approaching 0% (Zuhri et al., 29 Apr 2025, Gu et al., 2024).

3. Roles in Information Flow and Model Dynamics

Attention sinks play a dual role in information propagation. Theoretically, they act as "no-op" or "reference" tokens, limiting pathological over-mixing or rank collapse in deep, wide-context Transformers by absorbing excess attention mass and thereby gating the exponential spread of information through the network layers (Barbero et al., 3 Apr 2025).

This mechanism provides a method for stability: perturbations or spurious information are attenuated by being routed to the sink, maintaining sharper representational distinctions across tokens and avoiding collapse into low-rank spaces. Consequently, removing or disrupting sink tokens during inference causes abrupt performance degradation, especially in long-context regimes.

Empirical studies confirm that the presence of strong attention sinks sharply reduces sensitivity to upstream input changes beyond the first few layers and that this property becomes more pronounced with model scale and context size (Barbero et al., 3 Apr 2025, Gu et al., 2024).

4. Variations: Primary and Secondary Sinks, and Multimodal Generalizations

Primary attention sinks are global anchors that emerge in the first or second layer and persist throughout the model. Secondary sinks—lesser-known but statistically significant—can arise in middle layers, often triggered by high-norm outputs from specific MLP modules. These secondary sinks are transient, can occupy non-initial positions, and their emergence compensates for phases when the primary sink temporarily weakens, forming complex attention dynamics referred to as "sink levels" in scaling studies (Wong et al., 22 Dec 2025).

In multimodal transformers and vision-LLMs, analogous phenomena are observed. Visual tokens with large feature-norm (often corresponding to background or structural patches) act as ViT-origin attention sinks, attracting high attention but carrying little content information (Luo et al., 9 Oct 2025, Fan et al., 28 Feb 2026). These sinks are nearly invariant across inputs and can be effectively pruned or down-weighted without affecting performance (Fan et al., 28 Feb 2026, Kang et al., 5 Mar 2025).

5. Algorithmic Exploitation: Streaming, Compression, and Efficient Inference

Exploitation of attention sinks is a foundational technique for streaming language and video models. KV cache optimizations such as StreamingLLM achieve effective generalization to infinite contexts by retaining only the initial sink tokens in memory; all remaining past tokens are discarded. For standard LLM deployments, as few as four first tokens (or one learnable sink token after specialized pretraining) suffice to maintain perplexity and accuracy at long contexts, yielding substantial memory and speed gains—up to 22× faster per token at fixed cache (Xiao et al., 2023, Yi et al., 4 Dec 2025).

SF-Attn and variants in SinkLoRA and Deep Sink architectures generalize this strategy to local/global mixtures, using a small set of learnable sink tokens as global communicators in otherwise windowed sparse attention setups. These mechanisms recover the modeling power of dense attention at near-sparse time and memory cost (Zhang, 2024, Yi et al., 4 Dec 2025).

In token and neuron pruning, sink-aware methods penalize unstable or context-dependent sinks (critical in diffusion LMs where anchor stability is poor), reallocating compute to consistently useful tokens and improving accuracy–sparsity trade-offs (Myrzakhan et al., 19 Feb 2026).

6. Risks, Failure Modes, and Mitigation Strategies

While attention sinks are functionally useful, they also introduce vulnerabilities and training pathologies. Security risks include their exploitation as backdoor triggers during LLM unlearning; placing malicious triggers at sink positions magnifies the backdoor effect and enables restoration of supposedly "forgotten" knowledge (Shang et al., 19 Oct 2025). In continual learning, over-reliance on common sink tokens (such as [SEP] in BERT/RoBERTa) induces over-smoothing and cross-task interference, harming task transfer (Bai et al., 2024). In mixture-of-experts (MoE)-like native attention forms, excessive sink use causes head collapse, whereby a few attention heads monopolize computation. Sink-aware load balancing mitigates this effect, improving head utilization and overall accuracy (Fu et al., 1 Feb 2026).

Effective mitigation strategies include:

Replacing softmax with unnormalized or rectified kernels (e.g., Softpick) (Zuhri et al., 29 Apr 2025, Gu et al., 2024);
Introducing explicit, learnable key-only biases to decouple sink behavior from real tokens (Gu et al., 2024);
Scaling down positional encoding biases or distributing anchoring load across multiple tokens (Ruscio et al., 4 Aug 2025);
Pre-scaling or entropy-based regularization to enforce attention diversity, especially in continual learning (Bai et al., 2024);
Careful architectural design of KV caching and window masking to control sink formation and impact (Xiao et al., 2023, Zhang, 2024).

7. Broader Implications and Research Outlook

Understanding and controlling attention sinks is foundational for robust and efficient deployment of Transformers in long-context, streaming, continual learning, and multimodal tasks. They are deeply intertwined with architectural choices—especially normalization, positional encodings, masking, and objective formulation—and with emergent behaviors such as massive activation, over-mixing avoidance, and cross-modal alignment (Sun et al., 5 Mar 2026, Wong et al., 22 Dec 2025).

Current research continues to investigate the interplay between attention sinks, sparsity, structural redundancy, and implicit gating. Softmax normalization remains a key driver of sink formation, and alternative attention architectures are an active area of development for both efficiency and improved interpretability. The dynamic between primary & secondary sinks, and their role in scaling behavior and model specialization, signals a rich avenue for future mechanistic and theoretical analysis.

References: