Attention Sinks in Transformer LLMs

Updated 7 March 2026

Attention sinks are token positions in Transformer LLMs that accumulate an excessive share of self-attention regardless of semantic content.
They influence model behavior by controlling information flow, preventing rank collapse, and enabling efficient compression and redundancy management.
Attention sinks offer practical benefits for robust quantization, structured pruning, and architectural improvements, enhancing model stability and interpretability.

Attention sinks in Transformer-based LLMs designate a phenomenon where certain token positions—most commonly the first token—consistently attract a disproportionately large share of the self-attention mass in many heads and layers, independent of semantic content. Empirically, these sinks are not mere artifacts but serve as structural elements with implications for information flow, redundancy, compression, quantization robustness, and interpretability across model scales and tasks.

1. Mathematical Definition and Empirical Detection

An attention sink position $k$ in layer $\ell$ , head $h$ is characterized by the mean attention mass

$\mathrm{sink\text{-}score}_k^{(\ell,h)} = \frac{1}{T} \sum_{i=1}^T \alpha_{ik}^{(\ell,h)},$

where $\alpha_{ik}^{(\ell,h)}$ is the softmax attention from query $i$ to key $k$ . When $\mathrm{sink\text{-}score}_k^{(\ell,h)} > \tau$ for a threshold $\tau$ (commonly $0.3$), head $\ell$ 0 is designated a sink head for $\ell$ 1 (Barbero et al., 3 Apr 2025, Sun et al., 5 Mar 2026, Queipo-de-Llano et al., 7 Oct 2025). Aggregating over all heads and layers, the "sink ratio" $\ell$ 2 is defined as the average fraction of heads that concentrate at least $\ell$ 3 attention on some fixed position, typically the first (BOS) token. Detection involves forward-passing sequences, extracting attention maps, and thresholding average per-key scores (Sun et al., 5 Mar 2026).

2. Mechanistic Origins: Interaction of Softmax, Norms, and Architecture

Attention sink formation is rooted in the mechanics of softmax normalization: when no semantically relevant key is evident, softmax enforces a probabilistic sum to one, often "dumping" unused mass onto a token that is causally visible to all queries—typically the first token. This effect is reinforced when the query and key vectors at these positions have large or atypical norms (Barbero et al., 3 Apr 2025, Zhang et al., 2 Feb 2025, Qiu et al., 30 Jan 2026). Spectral analyses of embedding/unembedding operators show that attention sinks reside in the subspace of right singular vectors with the smallest singular values ("dark directions"); signals placed here are effectively "invisible" to logits, allowing the residual stream to absorb attention mass without influencing next-token prediction (Cancedda, 2024).

Pre-norm architectures in particular facilitate the co-occurrence of attention sinks and massive activations (spikes): rare, high-magnitude features propagate through residuals and normalization in a way that spike tokens (BOS, special delimiters) develop near-constant hidden state vectors, further promoting their status as attention sinks (Sun et al., 5 Mar 2026, Queipo-de-Llano et al., 7 Oct 2025).

3. Functional Roles: Information Flow, Redundancy, and Sparsity

Attention sinks serve several distinct purposes:

Mixing control and collapse avoidance: Sinks throttle excessive mixing (over-mixing) that leads to rank collapse and representational collapse, preventing information from becoming indiscriminately entangled across positions (Barbero et al., 3 Apr 2025, Queipo-de-Llano et al., 7 Oct 2025).
Rescaling via outlier-driven normalization: Outlier tokens in attention sinks cooperate with attention and RMSNorm to rescale the magnitude of other (non-outlier) tokens/dimensions. The result is a more stable residual stream, integral to effective training and inference (Qiu et al., 30 Jan 2026).
Mixture-of-Experts (MoE) gating: The gating induced by attention sinks acts as an implicit Mixture-of-Experts mechanism, where a head's contribution is modulated by the attention mass routed to the sink position. This results in head inactivity (head collapse) for many heads, with potential for load balancing by explicit regularization (Fu et al., 1 Feb 2026).

In multimodal transformers, similar visual attention sinks arise within background or uninformative regions, absorbing excessive attention mass, which can then be redistributed to enhance information flow in vision-language tasks (Kang et al., 5 Mar 2025).

4. Compression, Quantization, and Pruning Implications

Attention sinks are tightly connected to representational compression. The emergence of massive BOS activations coincides with low-rank singular-value spectra (compression valleys), forming bottleneck phases where features are highly linearly separable—embeddings perform optimally at intermediate (post-compression, pre-refinement) layers (Queipo-de-Llano et al., 7 Oct 2025).

For model compression, heads with high BOS sink scores are functionally redundant. Structured pruning that targets such heads can remove a substantial subset of attention parameters, preserving or even improving downstream performance better than magnitude- or activation-based heuristics (Sok et al., 11 Jan 2026). Efficient quantization schemes rely on preserving the precision of sink tokens—failure to do so leads to activation outliers and performance collapse. Plug-and-play approaches, such as KVSink or CushionCache, identify and protect sink-token K/V entries in quantized inference, enabling ultra-low-bit precision with minimal degradation (Su et al., 6 Aug 2025, Son et al., 2024).

5. Pretraining, Dynamics, and Specialization

During pretraining, attention sinks and dormant (inactive) heads emerge rapidly, reaching substantial prevalence early but displaying dynamic transitions: heads may alternate between dormant (sink-focused, near-zero output) and active (semantically routed) modes based on input structure and training evolution (Sandoval-Segura et al., 4 Apr 2025, Guo et al., 2024). Tasks that induce symbol-heavy or structured inputs (e.g., code, lists) boost dormancy, while prose activates more heads. This active–dormant mechanism is not static; heads respond to both dataset and token statistics (Sandoval-Segura et al., 4 Apr 2025, Guo et al., 2024). "Catch–tag–release" describes how sinks not only aggregate attention but induce outlier features that label downstream tokens for later retrieval—a pattern necessary for tasks like subsequence averaging (Zhang et al., 2 Feb 2025).

6. Mitigation, Regularization, and Alternative Designs

Several strategies can mitigate pathological sink formation or harness it for improved performance:

Regularization and gating: Imposing load-balancing or gating losses during training activates dormant heads and distributes attention more equitably, ameliorating head collapse (Fu et al., 1 Feb 2026). Input-conditioned gates in the attention mechanism further reduce the need for implicit sinks (Sun et al., 5 Mar 2026).
Architectural choices: Swapping softmax for ReLU or bounded nonlinearities, or employing sandwich/post-norm architectures, decouples or attenuates massive activations and attention sinks without hurting performance (Sun et al., 5 Mar 2026, Guo et al., 2024).
Elastic normalization: Relaxing the softmax constraint (Elastic-Softmax) in attention eliminates forced mass allocation and enables sparse, sink-free attention patterns, yielding high sparsity with no loss in accuracy (Fu et al., 1 Jan 2026).
Strategic prefixing: Prefixing attention sinks (e.g., as artificial K/V cache entries) suppresses activation outliers and ensures quantization-friendly downstream activations (Son et al., 2024).

7. Broader Impact on Interpretability and Model Design

The existence of attention sinks necessitates careful interpretation of attention maps: heads with persistent sink focus (especially on BOS) often yield negligible downstream contribution, misleading simplistic attribution. HONOR and BOS-sink criteria provide robust, model-agnostic ways to filter out such dead-weight heads for analysis and pruning (Sandoval-Segura et al., 4 Apr 2025, Sok et al., 11 Jan 2026). The three-phase "Mix–Compress–Refine" theory posits that LLM computation is organized into broad early mixing, a compressed bottleneck dominated by sinks, and late selective refinement. Embedding, classification, and retrieval tasks benefit from bottleneck phases, while text generation and reasoning rely on late refinement, clarifying observed performance layering in practice (Queipo-de-Llano et al., 7 Oct 2025).

Efficient deployment, compression, robustness (against both quantization artifacts and backdoor attacks exploiting sinks (Shang et al., 19 Oct 2025)), and interpretability in LLMs all depend on understanding, measuring, and, where necessary, controlling attention sinks. Ongoing research continues to explore how alternative normalization, gating, and sparsification schemes can substitute or remove sink-induced behaviors with improved efficiency and transparency.