Papers
Topics
Authors
Recent
2000 character limit reached

Attention Sink Suppression in Transformers

Updated 2 December 2025
  • Attention sink suppression is a phenomenon where transformer models allocate excess attention mass to semantically irrelevant tokens due to normalization constraints, affecting language, vision, and multimodal tasks.
  • Empirical analyses reveal that methods such as sigmoid attention, rectified softpick functions, and gating mechanisms can dramatically lower sink rates, stabilize activations, and enhance model quantization.
  • Architectural adaptations including dedicated KV caching, encoder-decoder separations, and spectral filtering techniques effectively redistribute attention, ensuring robust performance across diverse applications.

Attention sink suppression encompasses theoretical, algorithmic, architectural, and practical advances in mitigating the persistent allocation of attention mass to tokens (or spatial locations) that serve no semantic purpose but absorb excess probability due to constraints of normalization in transformer models. The phenomenon is observed across core LLMs, vision transformers, state-space models, multimodal architectures, and even in cognitive and developmental neuroscience.

1. Theoretical Origins of Attention Sinks

In standard transformer-based architectures, self-attention normalizes the dot-product of queries and keys across the sequence so that attention weights sum to one: at,i=softmax(qt⋅kid),∑iat,i=1a_{t,i} = \mathrm{softmax}\Big(\frac{q_t \cdot k_i}{\sqrt{d}}\Big), \quad \sum_i a_{t,i} = 1 This normalization mandates that when queries contain no strong informational preference, the resulting probability mass must be "dumped" somewhere—empirically into one or more tokens that act as attention sinks. In autoregressive LLMs (e.g., Llama-2, MPT, Falcon), layers from index two onward allocate the bulk of attention to the initial (often BOS) token, independent of its semantics or positional distance, due to its persistent visibility during all decoding steps (Xiao et al., 2023).

This phenomenon is not limited to language; in ViTs the [CLS] token attracts most attention mass, distorting feature integration among image patches (Feng et al., 9 Apr 2025). In state-space models, artificial anchors mimic sink behavior to stabilize recurrences (Meng et al., 1 Aug 2024). Sink emergence is shown to be a geometric effect: the first token's key/projected vector lies on a separable manifold and achieves maximized cosine similarity with most queries, giving rise to anomalously large softmax logits and persistent sink allocation (Gu et al., 14 Oct 2024).

2. Empirical Characterization and Measurement

Attention sinks are rigorously quantified by aggregating the attention weight from all positions to the candidate sink token(s). Formally, for token jj: αjl=1H(N−j+1)∑h=1H∑k=jNAhl[k,j]\alpha_j^l = \frac{1}{H(N-j+1)} \sum_{h=1}^{H} \sum_{k=j}^{N} A_h^l[k,j] A high αjl\alpha_j^l over tokens or patch indices signals sink behavior; thresholds (e.g., εs=0.3\varepsilon_s = 0.3 or $0.6$) are used to compute the "sink rate," i.e., the fraction of attention heads exhibiting sink behavior (Zuhri et al., 29 Apr 2025).

Massive activations—outlier hidden-state values—are tightly coupled: sink tokens display large magnitudes in a structured subset of coordinates (often layer-dependent), producing activation kurtosis orders of magnitude beyond non-sink tokens (Zuhri et al., 29 Apr 2025, Anand et al., 26 Oct 2025). In quantization and streaming deployment, these activation outliers induce dynamic range blowups, forcing specialized schemes for KV-cache management (Su et al., 6 Aug 2025).

3. Algorithmic Suppression Strategies

Suppression approaches fall into architectural modification, functional replacement, regularization, and dynamic masking:

A. Dedicated Sink Tokens and KV Cache Engineering

  • StreamingLLM caches the recent LL tokens and a fixed set of SS initial tokens or a learned placeholder, concatenating as [sink_KV,window_KV][\text{sink\_KV}, \text{window\_KV}] and computing attention on this compact set. This stabilizes normalization and restores full-context perplexity at O(L)O(L) token cost (Xiao et al., 2023).
  • In quantization, KVSink dynamically detects sink tokens via stable cross-layer activation outliers and preserves their full-precision KVs. This dramatically reduces quantization error, outperforming static "preserve first NN tokens" strategies (Su et al., 6 Aug 2025).

B. Normalization-Relaxation and Activation Rectification

  • Sigmoid attention, Sij=σ(QiKj⊤+bj)S_{ij} = \sigma(Q_iK_j^\top + b_j), removes the sum-to-one constraint and thus inter-token competition, effectively setting sink rate ≈2%\approx 2\% vs. 45%45\% under softmax (Gu et al., 14 Oct 2024).
  • Softpick replaces softmax with a rectified function:

Softpick(x)i=ReLU(exi−1)∑j∣exj−1∣\text{Softpick}(x)_i = \frac{\text{ReLU}(e^{x_i} - 1)}{\sum_j |e^{x_j} - 1|}

This renders up to 47%47\% of attention entries exactly zero, drops sink rate to 0%0\%, and suppresses activation outliers, without loss in LLM task accuracy or perplexity, and with enhanced robustness to low-bit quantization (Zuhri et al., 29 Apr 2025).

C. Regularization and Feature Decorrelation

  • In multimodal and audio-visual LLMs, a decorrelation loss penalizes cosine similarity between BOS and other tokens' hidden states:

Ldecor=1(N−1)(L−2)∑l=2L−1∑i=1N−1[cos-sim(Hl[i],Hl[0])]2\mathcal{L}_{\text{decor}} = \frac{1}{(N-1)(L-2)} \sum_{l=2}^{L-1} \sum_{i=1}^{N-1} \left[\text{cos-sim}(H^l[i], H^l[0])\right]^2

This disbands shared attractor directions and mitigates both attention sinks and massive activations (Anand et al., 26 Oct 2025).

D. Gating and Sparsity

  • A sigmoid gate applied after scaled-dot-product attention introduces query-dependent sparse masking: Y′=Y⊙σ(XWθ)Y' = Y \odot \sigma(XW_\theta). This reduces the average attention sink mass FF-Attn from $0.467$ to $0.048$ and lowers outlier amplitudes by an order of magnitude (Qiu et al., 10 May 2025).
  • Weak Attention Suppression (WAS) in speech recognition dynamically zeros out low-probability entries below a query-dependent threshold, sparsifying attention and suppressing sinks in highly correlated frame sequences (Shi et al., 2020).

4. Architectural Reframing and Modal Extensions

Suppression is not confined to token-level mechanisms; architectural changes can eliminate attention sinks by design:

  • Encoder-decoder separation (EDIT in vision transformers): The decoder attends only to patch outputs at each layer rather than a global [CLS] token, abrogating sink accumulation and redistributing attention to relevant regions, leading to more interpretable attention maps and higher classification/segmentation performance (Feng et al., 9 Apr 2025).
  • Grouped FIR filtering and sink injection in SSMs: Learnable prompt sinks act as stability anchors for state recurrences, with echo-injection matrices ensuring no group decays to zero (Meng et al., 1 Aug 2024).
  • Visual Attention Redistribution (VAR) in multimodal models: Surplus attention allocated to irrelevant visual sink tokens is recycled within image-centric heads, focusing attention budget onto semantically relevant patches. Masking out sink tokens leaves performance unchanged, verifying their non-essential nature (Kang et al., 5 Mar 2025).

5. Spectral Filtering and Parameter-Efficient Suppression

Recent work links the implementation of sink behavior to the tail end of the singular vector spectrum (the "dark signals") of embedding and unembedding matrices. Suppression via spectral filtering is realized as: Pu(tail)=∑b=b0kPu(b)P_u^{(\text{tail})} = \sum_{b=b_0}^k P_u^{(b)} Zeroing these dark components eliminates pathological sink allocation at the cost of higher loss. Sink-preserving "band-pass" filters retain just head and tail singular vectors, achieving a favorable trade-off between quality and sink suppression (Cancedda, 14 Feb 2024).

Low-rank parameterizations in QK, KV, or value matrices support catch–tag–release mechanisms for dynamic subsequence selection; pruning these structures (as in OATS approaches) directly suppresses sinks and outlier features (Zhang et al., 2 Feb 2025). However, wholesale removal catastrophically degrades downstream capabilities, necessitating careful regularization or explicit bias replacement.

6. Impact, Limitations, and Cross-Modal Generalization

Attention sink suppression enables stable long-context streaming (up to 4M+ tokens without retraining) (Xiao et al., 2023), robust low-precision quantization (Zuhri et al., 29 Apr 2025, Su et al., 6 Aug 2025), reliable visual and multimodal grounding (Feng et al., 9 Apr 2025, Kang et al., 5 Mar 2025), improved training stability, and consistent gains in Word Error Rate for speech recognition models under high compression (Anand et al., 26 Oct 2025, Shi et al., 2020).

Diffusion LLMs have moving, context-dependent sinks that, when masked, lead to only minor performance drops—revealing fundamental differences from autoregressive models, which collapse when their sinks are suppressed (Rulli et al., 17 Oct 2025).

A significant limitation is the trade-off between removing sinks (and their associated outlier mass) and preserving relevant bias or selection mechanisms; efficient suppression often requires model architectural changes, precise spectral filtering, or fine-tuned regularization weights. Sink suppression is essential when precision, interpretability, or context extrapolation is paramount.

7. Developmental, Cognitive, and Neurobiological Parallels

In cognitive neuroscience, surround suppression (sometimes labeled "sink suppression") delineates the active inhibition of stimuli adjacent to the focus of attention, manifesting as reduced discrimination accuracy near attended targets. This effect emerges reliably only after age 12, tightly linked to the maturation of top-down networks (fronto-parietal projections) and hierarchical selective-tuning (Wong-Kee-You et al., 2018). The selective-tuning model encodes pass/suppress recursions that mirror the annular sink fields in transformer architectures: focal WTA selection and annular surround inhibition.

Table: Suppression Strategies and Quantitative Impact

Paper / Approach Core Suppression Mechanism Sink Rate or Metric
(Xiao et al., 2023) StreamingLLM Window+sink KV cache; learnable placeholder PPL matches full recompute
(Gu et al., 14 Oct 2024) Sigmoid Attention Remove normalization; sigmoid kernel Sink≈\approx2% (vs 45%)
(Zuhri et al., 29 Apr 2025) Softpick (rectified attn) ReLU(e{x}-1)/abs sum; sum≠\neq1 0% sink rate; 47% sparsity
(Qiu et al., 10 May 2025) SDPA + Sigmoid Gate Query-dependent gating post-SDPA F-Attn 0.048 (vs 0.467)
(Su et al., 6 Aug 2025) KVSink Outlier detection at emergence layer ∼\sim12.6% PPL reduction
(Anand et al., 26 Oct 2025) Cosine decorrelation Penalize alignment to BOS direction WER: ∼\sim11% drop VSR 5x
(Feng et al., 9 Apr 2025) EDIT (ViT) Encoder/decoder separation; layer-wise attn +1.9%+1.9\% ImageNet-1k acc
(Cancedda, 14 Feb 2024) Spectral filtering Drop tail singular vector subspace NLL+0.3 increase (mid-band)

Attention sink suppression thus integrates advances in normalization, gating, spectral filtering, specialized cache engineering, and architectural reframing. These developments underpin efficient, high-quality, interpretable, and scalable sequence modeling in transformers and state-space models, with established impact across language, vision, speech, and multimodal domains.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Attention Sink Suppression.