- The paper identifies a specific circuit—comprising learned query bias, MLP-enhanced positional encoding, and structured key projections—that causes GPT-2's attention sinks.
- It employs targeted causal interventions and ablation experiments to demonstrate that this circuit is both necessary and sufficient for the anomalous first-position attention.
- Results highlight the need for architecture-specific mitigation strategies to address attention sinks in Transformer-based models.
Mechanistic Dissection of Attention Sinks in GPT-2: Circuit Architecture and Implications
Introduction
The paper "A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation" (2604.14722) provides an in-depth causal analysis of the attention sink phenomenon in Transformer models, specifically GPT-2 architectures employing learned query biases and absolute positional encodings. An attention sink is the persistent, semantically invariant concentration of attention on the initial position in a sequence. This effect is robust across datasets, tasks, and even varying architectural choices, but its underlying cause remains only partially understood.
The authors integrate structural analysis and targeted causal interventions to explicitly identify a parameter-level circuit responsible for the sink in GPT-2. They empirically demonstrate that the interaction of (i) learned query bias, (ii) first-layer MLP-transformed positional encoding, and (iii) structured key projections is both necessary and sufficient to produce the large source-agnostic score shift driving the attention sink. Importantly, the paper shows that this specific circuit is not universal, highlighting the necessity for architecture-aware mitigation of attention sinks.
Theoretical Framework and Mechanism
The attention computation yields a pre-softmax logit matrix in which, for every head and layer, the term
Δj,h(l)=bQ,h(l)Wk,h(l)⊤xj(l)⊤
constitutes a target-position-specific but source-invariant additive shift. The paper rigorously demonstrates that for GPT-2, Δ1,h(l) (where j=1) is consistently and anomalously large, biasing all tokens to disproportionately attend to position~1, the canonical "sink" position.
Figure 1: The source-agnostic shift Δj,h(l) is systematically large at the first position, reflecting a strong prior toward attending to the first token.
To parse the origin of this effect, the authors introduce the concept of "Effective Positional Encoding" (EPE), defined as
EPEi=MLP(1)(pi)+pi
where pi is the learned positional embedding and MLP(1) is the first-layer feed-forward block. Through ablation and alignment studies, the paper demonstrates (i) that EPE1 at position~1 exhibits massive, input-invariant activations, (ii) has strong alignment with the learned query bias post-key-projection, and (iii) is dominant in the coordinates with largest key-projection—creating a circuit for sink formation.
Figure 2: The first-position key projection EPE1Wk,h is strongly aligned with the query bias, driving the sink.
Figure 3: EPE robustly matches the actual positional signal over all positions, with nearly perfect correspondence at position~1.
Figure 4: Δ1,h(l)0 is large exactly where the bias projection is large, indicating coordinate-level co-adaptation.
Empirical and Causal Validation
The authors' methodology is comprehensive: they compute per-head, per-layer Δ1,h(l)1 statistics across standardized datasets and validate the source of the sink through four distinct lenses:
Causal Interventions and Necessity/Sufficiency
The paper's strongest claims rest on a suite of targeted forward-pass interventions. Nullifying the query bias, removing the first positional encoding, swapping EPEs, disabling MLPs, or zeroing the top-j=14 coordinates of j=15 at "massive activation" indices each significantly diminishes the sink, as quantified by the BOS-attention metric (e.g., reducing from j=16 baseline to j=17 or j=18; see also decrease to j=19 or Δj,h(l)0 of baseline values depending on intervention). Control interventions—e.g., nullifying the BOS token embedding, swapping raw positional embeddings, or zeroing random Δj,h(l)1 coordinates—have negligible effect, eliminating the possibility of confounds.
Distributed Circuitry and Architectural Specificity
A central claim, underpinned by the authors' broader analysis and supported by the ablation experiments, is that none of the three circuit components (query bias, first-layer MLP-processed position, or key-projection structure) are universally necessary across all Transformer variants. Alternative architectures without Δj,h(l)2, MLP blocks, or explicit positional encoding still exhibit sinks, implying that distinct architectures construct alternative mechanistic solutions for the sink bias (e.g., using a constant-feature slice of Δj,h(l)3 to substitute for Δj,h(l)4). This property is strongly corroborated by parallel empirical and theoretical work (Gu et al., 2024, Sun et al., 2024, Qiu et al., 10 May 2025, Qiu et al., 30 Jan 2026), which identifies other possible formations and mitigating strategies.
Implications for Mitigation and Future Directions
The analysis directly influences both theoretical and practical perspectives:
- Mitigation Strategies Must Be Architecture-Specific: Because there is no universal parameter pathway for the attention sink, attempts to mitigate it via pre-training architectural edits (e.g., removing Δj,h(l)5) or post-training corrections (e.g., zeroing positional encoding) are likely to be subverted by optimization reconstructing alternative circuits.
- Functional Role and Robustness: While the mechanistic origin is dissected, the paper does not establish a functional rationale for the sink (beneficial, harmful, or both). Other works suggest plausible roles, such as countering over-mixing or guaranteeing computational compositionality (Barbero et al., 3 Apr 2025, Ran-Milo, 12 Mar 2026).
Future research directions are clear: large-scale, cross-architecture mechanistic studies are needed to enumerate sink circuits in non-GPT-2 families (e.g., RoPE, ALiBi, Mamba), investigate learning-stage dynamics of the sink pathway emergence, and design robust, architecture-aware mitigation schemes for diverse application domains.
Figure 6: Full (non-truncated) histogram of the source-agnostic shift Δj,h(l)6 confirming the comparative outlier status of the first position.
Figure 7: A few coordinates of Δj,h(l)7 exhibit massive activations, serving as specialized conduits for the sink mechanism.
Figure 8: Full (non-truncated) histogram of Δj,h(l)8, highlighting the outlier property of the "massive activation" coordinates.
Conclusion
This work decisively elucidates the mechanistic underpinnings of attention sinks in GPT-2-class architectures, tracing the effect to a specific, dissectible circuit formed by the interaction of learned query bias, first-layer MLP-amplified positional encoding, and key projection specialization. Through causal interventions, it is established that this circuit is both sufficient and necessary in GPT-2, but not a universal mechanism, motivating per-architecture mitigation strategies. While secondary contributors remain and the functional utility is unresolved, this mechanistic advance lays the foundation for a new class of robust, causal interpretability and targeted model editing for Transformer-based systems.