Why do LLMs attend to the first token? (2504.02732v3)

Published 3 Apr 2025 in cs.CL

Abstract: LLMs tend to attend heavily to the first token in the sequence -- creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

Collections

Summary

The paper shows that attention sinks focused on the first token help control over-mixing and prevent representational collapse in deep LLMs.
It derives an over-squashing bound that incorporates multi-head attention, revealing how model depth and context length amplify sensitivity in token representations.
Empirical tests on models like Gemma 7B and LLaMa 3.1 confirm that a fixed BOS token is crucial for maintaining stable attention patterns and mitigating performance degradation.

This paper investigates the phenomenon of "attention sinks" in LLMs, where attention heads disproportionately focus on the first token of a sequence (often the BOS token), even if it lacks semantic meaning. Instead of viewing sinks as problematic, the authors argue they are a functionally useful mechanism learned by LLMs to prevent "over-mixing" of information, particularly in deep models trained on long contexts (2504.02732).

The core hypothesis connects attention sinks to theoretical concepts like rank collapse, representational collapse, and over-squashing, which describe how information can degrade or become overly uniform as it propagates through deep Transformers or across long sequences. The authors posit that attention sinks act as a control mechanism to slow down this information mixing, thereby preserving distinct token representations deeper into the model and for longer contexts (2504.02732).

Theoretical Contributions and Validation:

Connecting Collapse Phenomena: The paper establishes a theoretical link, showing that rank collapse (representations converging towards their mean) is a stronger condition that implies representational collapse (adjacent token representations becoming indistinguishable) (2504.02732). These phenomena highlight the tendency towards over-mixing.
Over-Squashing Bound: An extended theoretical bound on over-squashing (how much a perturbation at an input token $i$ affects an output token $j$ ) is derived, explicitly incorporating multi-head attention. This bound shows that sensitivity $\left \lVert \partial v^{(L)}_j / \partial v^{(0)}_i \right \rVert$ increases with depth ( $L$ ) and path weights ( $\bar{\alpha}_{ij}^{(\ell)}$ ) (2504.02732).

$\left \lVert \partial v^{(L)}_j / \partial v^{(0)}_i \right \rVert \leq C^L_{max} \sum_{k \in \mathcal{P}_{i \to j}} \bar{\alpha}^{(1)}_{i, k_1} \bar{\alpha}^{(2)}_{k_1, k_2} \dots \bar{\alpha}^{(L)}_{k_{L-1}, j}$

Attention sinks, by concentrating attention ( $\alpha_{ij}^{(\ell, h)}$ ) on the first token, effectively reduce the weights along paths between other tokens, thus lowering sensitivity and mitigating over-mixing (2504.02732).
Perturbation Experiments (Gemma 7B): Empirical tests on Gemma 7B validate this. Perturbing a single token (e.g., 'greatest' -> 'best') causes significantly larger changes in subsequent token representations throughout the model when the BOS token (the sink) is removed compared to when it is present. Removing the BOS token also leads to smoother (less sparse) attention maps, indicating increased mixing (2504.02732).
Approximate No-Ops: Analysis of a specific head in Gemma 7B suggests sinks facilitate an "approximate no-op" behavior. The head attends heavily to the BOS token (which has a low-norm value vector) by default, effectively passing information through the residual connection with minimal change. It only attends sharply to other tokens (like an apostrophe) when specific conditions are met. This allows heads to remain selectively inactive, further controlling information flow (2504.02732).

Empirical Scaling Trends:

Context Length: Models (120M parameters) were pre-trained from scratch with varying context lengths (128 to 2048). Results show that models trained on longer contexts develop significantly stronger attention sinks, supporting the hypothesis that sinks are needed to counteract increased mixing potential in longer sequences (2504.02732).
Model Size (LLaMa 3.1 Family): Analysis across LLaMa 3.1 models (8B, 70B, 405B) reveals that larger, deeper models exhibit a much higher proportion of heads forming strong sinks. For LLaMa 3.1 405B, nearly 80% of attention weights (using $\epsilon=0.8$ threshold) were directed towards the first token. This aligns with the theory that deeper models require stronger mechanisms to prevent over-mixing (2504.02732).

Role of the BOS Token and Data Packing:

Experiments were conducted to determine if the specific BOS token is special or if any token at the first position can serve as a sink.

If a model is trained without a BOS token always fixed at the start, it learns to use whatever token happens to be first as the sink (2504.02732).
If a model is trained with the BOS token fixed at the first position (a common pre-training strategy), the model becomes reliant on that specific token for the sink mechanism. Removing the BOS token during inference in this case drastically degrades performance and eliminates the sink pattern (2504.02732). This indicates that while the sink mechanism is a general strategy to control mixing, its specific implementation can depend on pre-training choices like data packing and the consistent presence of a BOS token (2504.02732).

In conclusion, the paper presents compelling theoretical arguments and empirical evidence that attention sinks are not merely an artifact but a useful, learned mechanism in LLMs. They serve to counteract over-mixing and representational collapse, particularly crucial for the stability and performance of very deep models and those operating on long contexts. This understanding has implications for model analysis, interpretation, and potentially architecture design (2504.02732).