Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stable Sink Token in Transformers

Updated 1 February 2026
  • Stable sink token is a persistent attractor in transformer attention, providing a fixed geometric reference with high cosine similarity across layers.
  • It emerges from architectural mechanisms like unrotated positional encoding, LayerNorm, and softmax normalization, ensuring robust token stabilization.
  • The token facilitates efficient inference, sparsity, and compression by guiding token selection and preventing representational collapse in deep networks.

A stable sink token is a persistent, structurally-induced attractor in the attention computations of transformer networks, characterized by receiving an outsized, layer-wise attention allocation from a substantial subset of query positions—often irrespective of semantic content. In autoregressive LLMs, this role is almost always played by the first token (commonly a [BOS], <s>, or similar initialization symbol), but the phenomenon generalizes to other architectural variants, modalities, and pretraining regimes. The stability of the sink token is fundamental to information flow, optimization, and the computational geometry of deep transformer networks.

1. Mathematical Definition and Geometric Properties

Let LL denote the number of layers, and %%%%1%%%% the context or sequence length in a transformer. At layer ll (1lL1 \leq l \leq L), the (pre-attention) hidden state for token ii is hilRdh_i^l \in \mathbb{R}^d, with i=0i = 0 reserved as the canonical sink token (typically the first position). The unit-normalized hidden state is defined as h^il=hil/hil2\hat{h}_i^l = h_i^l/\Vert h_i^l \Vert_2.

The canonical indicator of the sink token's stability is the layer-wise drift:

Δsl=h^sl+1h^sl21\Delta_s^l = \Vert \hat{h}_s^{l+1} - \hat{h}_s^l \Vert_2 \ll 1

Empirical studies across Llama-2, Mistral, Llama-3, and related architectures demonstrate that for all l>lsinkl > l_{\text{sink}} (the onset of the sink phenomenon), h^sl\hat{h}_s^l remains essentially constant, with cos(h^sl1,h^sl2)0.95\cos(\hat{h}_s^{l_1}, \hat{h}_s^{l_2}) \approx 0.95–$0.99$ even among distant layers (Shin et al., 5 Jul 2025).

The cosine similarity between any other token and the sink,

cos(h^il,h^sl)=(h^il)h^sl,\cos(\hat{h}_i^l, \hat{h}_s^l) = (\hat{h}_i^l)^\top \hat{h}_s^l,

increases monotonically with depth: tokens are geometrically "attracted" to the static direction defined by the sink, converging toward its point on the representation hypersphere.

This geometric attractor property is robust: token trajectories in the normalized hidden-state space point toward the fixed sink direction, forming a reference axis that fundamentally organizes the coordinate system of deep transformer representations (Shin et al., 5 Jul 2025, Ruscio et al., 4 Aug 2025).

2. Mechanistic Origins and Architectural Determinants

The emergence and stability of sink tokens arise from several mechanistic and architectural sources:

  • Positional encoding and masking: In standard RoPE (rotary positional encoding), the first token (index 0) is always unrotated, conferring a unique, unperturbed reference in the key space (Ruscio et al., 4 Aug 2025). Absolute positional encodings used in encoders (e.g., BERT, RoBERTa) can induce dual (bidirectional) stable sinks at both ends (Ruscio et al., 4 Aug 2025).
  • LayerNorm, residual connections, and feature pinning: The first token’s features, propagated through residual pathways and LayerNorm/RMSNorm, tend to maintain large and stable magnitudes. These are insufficiently affected by stochastic or input-dependent noise, producing a hidden state with persistent directionality across depth (Shin et al., 5 Jul 2025).
  • Softmax normalization constraint: The probability simplex enforced by softmax compresses attention such that any residual or non-informative "budget" is persistently absorbed by the reference (sink) key, acting as a mathematically-optimal key-bias to soak up unused mass (Gu et al., 2024).
  • Training and data distribution: Sinks meed sufficient data and optimization before stabilizing. Their strength is modulated by learning rate, weight decay, and context packing, but they almost always arise in large-scale LM pretraining (Gu et al., 2024, Barbero et al., 3 Apr 2025).

This structural origin is not idiosyncratic to decoder-only LMs or NLP—variants with different position encoding schemes (e.g., scaled RoPE in Qwen/Phi-2) yield distributed reference frames and hence distributed (multi-token) sinks (Ruscio et al., 4 Aug 2025). In vision, improper token mixing (e.g., in ViTs) results in the [CLS] token becoming a sink unless carefully architected (Feng et al., 9 Apr 2025).

3. Functional Role in Information Propagation and Model Stability

Stable sink tokens implement a regularizing constraint on sequence-wide mixing:

  • Avoidance of rank and representational collapse: Without constrained mixing, repeated attention across layers drives the hidden-state matrix toward low-rank (or even rank-1) subspaces, resulting in over-mixing and representational collapse for late-position tokens (Barbero et al., 3 Apr 2025). The stable sink, by absorbing a large fraction of attention, attenuates this process by "turning off" mixing in a subset of heads/layers—a defense against over-squashing (Barbero et al., 3 Apr 2025).
  • No-op update zones: Tokens highly aligned with the sink receive minimal further updates, effectively skipping full computation while preserving residual contributions, as exploited in token selection and dynamic sparsification (Shin et al., 5 Jul 2025).
  • Reference-frame anchoring: The sink provides a geometric reference axis for self-attention alignment, enabling transformers to order and coordinate information in high-dimensional space (Ruscio et al., 4 Aug 2025).

These roles generalize to different modalities and architectures. In diffusion LLMs (DLMs), the instability of moving sinks (due to dynamic masking) disrupts robustness, motivating explicit insertion of a static sink token to regularize the residual stream (Zhang et al., 27 Jan 2026).

4. Practical Exploitation: Efficient Inference, Sparsity, and Compression

Recent methods leverage the stable sink to optimize inference and resource utilization:

  • OrthoRank dynamic selection: By measuring token importance as orthogonality to the sink

wil=1cos(h^il,h^sl),w_i^l = 1 - \cos(\hat{h}_i^l, \hat{h}_s^l),

and updating only the most orthogonal tokens each layer (those farthest from the sink on the hypersphere), OrthoRank achieves lower perplexity and improved zero-shot accuracy at the same or lower computational cost versus standard layer-pruning approaches (Shin et al., 5 Jul 2025). Importantly, this method yields superior performance on long-context tasks such as LongBench.

  • KV cache quantization and storage: The key-bias nature of the sink token underlies strategies for selective precision preservation. KVSink, for instance, identifies tokens with stable outlier activations in a designated layer/channel and preserves their key/value representations at full precision during quantization, attaining substantial improvements over naive first-N preservation strategies (Su et al., 6 Aug 2025).
  • Token selection in structured sequences: In click-through rate prediction, CTR-Sink strategically inserts trainable sink tokens between behaviors, carrying external correlation signals. These guide attention to meaningful boundaries, mitigate semantic fragmentation, and systematically improve empirical performance (Li et al., 5 Aug 2025).
  • Video diffusion and long-context sequence models: Deep Sink, in autoregressive video diffusion, dedicates half of the sliding attention window to persistent sink tokens, using temporal RoPE alignment to anchor global context and prevent drift during out-of-distribution, long-range generation (Yi et al., 4 Dec 2025).

5. Deviations, Modifications, and Countermeasures

The presence of stable sink tokens is not inevitable; specific modifications disrupt or entirely eliminate the phenomenon:

  • Alternative normalization: Replacing softmax with non-sum-to-one or rectified-normalization operations (e.g., Softpick), disallows the concentration of mass and hence eliminates attention sinks. Softpick's formula

Softpick(x)i=max{0,eximem}j=1nexjmem+ϵ\text{Softpick}(x)_i = \frac{\max\{0, e^{x_i - m} - e^{-m} \}}{\sum_{j=1}^n |e^{x_j - m} - e^{-m}| + \epsilon}

guarantees 0% sink rate in both empirical and theoretical analyses. The result is elimination of massive activations and more robust quantization at ultra-low bitwidths (Zuhri et al., 29 Apr 2025).

  • Architectural realignment: Architectures such as EDIT for vision maintain attention stability by reengineering the encoder–decoder coupling, preventing [CLS] from collapsing to a sink and ensuring distributed, layer-wise information flow. This preserves richer token representations and yields consistently better segmentation and transfer performance compared to vanilla ViTs (Feng et al., 9 Apr 2025).
  • Dynamic sink relocation or suppression: Randomizing the initial token position, or controlling key bias via small additive logit modifications, produces weaker or distributed sinks at the cost of interpretability or mixing (Gu et al., 2024, Barbero et al., 3 Apr 2025).
  • Explicit sink tokens in diffusion: In DLMs, attention sinks migrate unpredictably due to dynamic masking. Inserting a dedicated token constrained by an attention mask to attend only to itself provides a static, low-norm, globally visible sink, restoring stability and regularization (Zhang et al., 27 Jan 2026).

6. Experimental and Empirical Characterization

Empirical analysis demonstrates that stable sink tokens:

A selection of empirical observations:

Empirical Observation Paper/Source Finding
Δsl0\Delta_s^l \approx 0 (Shin et al., 5 Jul 2025) Sink token direction static after lsinkl_\text{sink}
Sink rate \sim60–80% (Gu et al., 2024, Barbero et al., 3 Apr 2025) LLMs (70B+) exhibit high sink rate after pretraining
0% sink rate with Softpick (Zuhri et al., 29 Apr 2025) Rectified normalization eliminates sinks
OrthoRank Δ\DeltaPPL =0.4...0.8=-0.4...-0.8 vs pruning (Shin et al., 5 Jul 2025) At 20% compute, superior to standard layer-pruning

These findings demonstrate the universality, impact, and practical leverage of stable sink tokens across scales, modalities, and domains.

7. Broader Implications and Future Directions

The existence and functionality of stable sink tokens have influential consequences for model design, interpretability, and efficient inference:

A plausible implication is that future work will increasingly focus on adaptive or data-driven formation of sink tokens, alignment with domain structure, and hybrid approaches that blend sink-aware sparsification with dynamic reference-frame selection. The theoretical insight that attention sinks are not quirk artifacts, but optimal solutions for imposing high-dimensional coordinate systems on transformer representations, reframes both the analysis and practical exploitation of the stable sink token phenomenon.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stable Sink Token.