Attention Sinks & Compression Valleys

Updated 11 October 2025

Attention sinks and compression valleys are phenomena in transformer models characterized by massive activations drawing disproportionate attention and inducing low-entropy, rank-reduced representations.
They arise from the softmax normalization in attention mechanisms, which concentrates focus on specific tokens and compresses information in intermediate layers across language, vision, and multimodal architectures.
Understanding these intertwined behaviors informs strategies for model pruning, quantization, and efficient inference, ultimately optimizing performance and computational efficiency in diverse AI tasks.

Attention sinks and compression valleys are two recurring, deeply linked phenomena observed in transformer-based architectures for language, vision, multimodal, and generative models. Attention sinks refer to tokens or features that systematically attract a disproportionate share of the model's attention—often independent of semantic import—while compression valleys denote phases or regions in model depth or representation space where information is compressed into low-dimensional structures, typically accompanied by concentrated singular value spectra and reduced entropy. Recent research has revealed that both phenomena often originate from the formation of massive activations or outlier features and are fundamentally connected, shaping information flow, computational efficiency, and compression in modern large models.

1. Definitions and Mechanistic Origins

Attention sinks are tokens—initial (e.g., beginning-of-sequence [BoS]), punctuational, or structural markers—that consistently receive high attention weights across heads, layers, and modalities, even in the absence of strong semantic demand (Gu et al., 14 Oct 2024, Zhang et al., 2 Feb 2025, Yu et al., 22 Jun 2024). The phenomenon arises primarily due to the intrinsic coupling of scores enforced by softmax normalization in standard attention, as demonstrated by the equation

$A_{i,1} \gg \text{mean}\{A_{i,j},\ j \neq 1\},$

where $A_{i,1}$ is the attention from token $i$ to the first token.

Compression valleys are discrete model regions (often at intermediate depth) where the singular value spectrum of the layer’s residual stream exhibits pronounced anisotropy: almost all energy is concentrated in one or a few directions (Queipo-de-Llano et al., 7 Oct 2025). For a representation matrix $X$ , a “compression valley” is characterized mathematically by a dominant leading singular value $\sigma_1$ accompanied by a sharp decrease in Shannon entropy $H(X)$ computed over the spectrum $\{p_j = \sigma_j^2 / \|X\|_F^2\}$ .

The mechanisms underlying both are unified: when a token (typically BoS) accumulates a massive activation norm, two outcomes are observed (Queipo-de-Llano et al., 7 Oct 2025):

Attention heads sink to the massive token, resulting in a valley in the attention allocation landscape.
The representation matrix becomes nearly rank-1, compressing information (compression valley).

This connection is quantified by

$\sigma_1^2 \geq M + \alpha R$

where $M$ is the squared norm of the massive token, $R$ the sum over all other tokens, and $\alpha$ encodes alignment.

2. Empirical Manifestations Across Architectures

LLMs

Attention sinks appear universally during pre-training—even in models with tens of millions of parameters—emerging early once optimization effectively reduces training loss (Gu et al., 14 Oct 2024). Their position is determined by data distribution and loss function; e.g., BoS becomes the attention sink because it is the only token never penalized in next-token prediction. Massive activations propagate through the residual stream via architectural factors such as pre-norm residual connections (Queipo-de-Llano et al., 7 Oct 2025, Gu et al., 14 Oct 2024). Intermediate layers coincide with pronounced compression valleys as measured by a spike in the leading singular value and a drop in matrix entropy.

Vision and Multimodal Transformers

Massive “sink” tokens (“CLS” for vision, high-norm ViT tokens in VLMs) are dominant in mid-to-late transformer layers (Lu et al., 21 Jul 2025, Luo et al., 9 Oct 2025). In LVLMs, ViT attention sinks correspond to high-level semantic elements (e.g., main scene objects), and their propagation into the LLM reasoning stage enhances global reasoning. Mutually suppressive behavior among massive and artifact tokens further structures both the attention and the value space (Lu et al., 21 Jul 2025).

Generative and Compression Models

In learned image compression and diffusion transformers, local spatial or graph-structured attention mechanisms (e.g., k-NN attention, windowed attention) can avoid formation of redundant attention sinks by restricting token neighborhoods, which deepens “compression valleys” in the rate-distortion landscape (Spadaro et al., 3 Oct 2024, Yuan et al., 12 Jun 2024, Zhang et al., 28 Mar 2025). In streaming and dialogue models, special “end-of-utterance” tokens or retained BoS tokens act as sinks that anchor history, enabling scalable memory usage at little cost in performance (Xiao et al., 2023, Li et al., 13 Mar 2024).

3. Theoretical Analysis and Unified Frameworks

The theoretical connection between attention sinks and compression valleys is formalized by lower bounds on the leading singular value and entropy reduction induced by the emergence of massive activations (Queipo-de-Llano et al., 7 Oct 2025): $\text{If}\quad \|x_0\|^2 \gg \sum_{i\neq 0} \|x_i\|^2,\quad \Rightarrow\quad \sigma_1^2 \approx \|x_0\|^2$ leading to

$H(X) \leq -p\log p - (1-p)\log(1-p) + (1-p)\log(r-1)$

with $p$ the dominant singular ratio.

The Mix-Compress-Refine theory posits three depthwise computation phases (Queipo-de-Llano et al., 7 Oct 2025):

Broad mixing: Early layers diffuse information across tokens. No dominant norms.
Compressed computation: Intermediate layers produce massive activations and attention sinks, yielding rank-reduced representation and compressed entropy (“valley”).
Selective refinement: Late layers re-equalize norms and redistribute attention for detailed or token-specific generation.

Compression valleys facilitate separation of embedding versus generative utility: static embeddings are optimal at the valley due to their high linear separability; generative performance requires full-depth passage through the refinement phase.

4. Consequences for Compression and Efficiency

Model Pruning, Quantization, and Streaming

Attention sinks can act as “key biases” independent of value computation, so pruning or quantization techniques that fail to preserve structures associated with sinks-and-outlier features often degrade performance, leading to “compression valleys” in the accuracy versus compression trade-off (Zhang et al., 2 Feb 2025, Gholami et al., 7 Mar 2025, Su et al., 6 Aug 2025). Empirically, methods retaining a sparse-plus-low-rank decomposition in QKV weights (e.g., OATS) maintain the catch, tag, and release mechanism underpinning few-shot or averaging tasks, whereas dense pruning erases attention sinks and degrades function (Zhang et al., 2 Feb 2025). In KV cache quantization, preserving tokens identified as sinks is critical for mitigating precision loss and performance drop (Su et al., 6 Aug 2025).

Efficient Inference and Latency Optimization

Strategies such as windowed or streaming attention exploit and preserve attention sinks—e.g., maintaining the first several KV states as anchor points—enabling constant or sublinear memory/compute scaling in sequence length (Xiao et al., 2023, Li et al., 13 Mar 2024). StreamingLLM demonstrates that by designating and maintaining dedicated sink tokens (either persisted or learned at pre-training), it is possible to process millions of tokens with stable perplexity and minimal decoding latency impact (Xiao et al., 2023).

Sparseand Structured Attention Mechanisms

Approaches such as Native Sparse Attention and graph-based attention selectively control which tokens may act as sinks or anchors, dynamically modulating attention allocation and avoiding excessive information compression in local valleys, thus retaining both efficiency and context fidelity (especially in long-video and high bit-rate generative tasks) (Song et al., 2 Oct 2025, Spadaro et al., 3 Oct 2024).

Language Tasks: Sinks at BoS or punctuation may be adaptive (for streaming or few-shot contexts) or detrimental (diluting attention from semantically critical tokens). Calibration techniques like ACT can suppress excessive or undesired sinks at inference to increase accuracy without retraining (Yu et al., 22 Jun 2024).
Vision and VLMs: In Vision Transformers, high-norm “ViT sinks” highly influence downstream reasoning when propagated into the LLM, encoding global or high-level semantics. Selective reordering (“sink-to-the-front”) or dual projection architectures further permit task-aware leveraging of sink contributions (Luo et al., 9 Oct 2025).
Compression Tasks: Graph-based attention blocks and sparsified attention lead to steep compression valleys (marked entropy minima), allowing higher bit allocation to critical features while suppressing redundancy (Spadaro et al., 3 Oct 2024, Yuan et al., 12 Jun 2024).
Retrieval and Multimodal Generation: Adaptive compression algorithms (e.g., AttnComp’s Top-P strategy) exploit attention sinks to guide context filtering, retaining only segments with cumulative attention above relevance thresholds. Valleys in the compression profile reflect concentrated information after irrelevant content is excised (Luo et al., 22 Sep 2025).

6. Practical Guidance and Future Implications

Compression and pruning methods must preserve low-rank and outlier-subspace terms that underlie attention sink formation to avoid dramatic drops in reasoning or few-shot accuracy (Zhang et al., 2 Feb 2025).
Dynamic detection and calibration of sinks—during both training and inference—offer a means to boost model accuracy and efficiency without retraining (Yu et al., 22 Jun 2024, Su et al., 6 Aug 2025).
The identification of connections between sink behavior and compression valleys informs new directions for phase-aware early exiting, hardware-optimized attention, and modally adaptive architectures. For embedding extraction, intermediate compressed representations should be favored, while for generative or token-specific tasks, late-layer refinement is necessary (Queipo-de-Llano et al., 7 Oct 2025).
Altering the attention kernel normalization (e.g., sigmoid rather than softmax) may mitigate emergent sink formation (Gu et al., 14 Oct 2024); however, the impact on downstream performance and model expressivity is a subject of ongoing research.

7. Summary Table: Key Connections

Phenomenon	Mechanistic Cause	Impact	Mitigation/Exploitation
Attention Sink	Massive activations (BoS)	Disproportionate attention, key bias effect	Sink token preservation/calibration
Compression Valley	Dominant singular spectrum	Low entropy, low-rank subspace, infoselect	Low-rank preservation, task-aware selection
Outlier features	Tagging/catch-release via sink	Enables functional ‘averaging’/reasoning	Sparse-plus-low-rank compression
KV Quantization Error	Sinks stretch quantization range	Potential perf. drops in low-bit regimes	Dynamic sink detection (KVSink)

Attention sinks and compression valleys are deeply intertwined phenomena rooted in the formation and propagation of massive activations within deep transformers. Recognizing and leveraging their connection enables efficiency, compression, and improved reasoning capability across a variety of model families and modalities, while also revealing crucial strategies for future architecture design and model deployment.