A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation

Published 16 Apr 2026 in cs.LG | (2604.14722v1)

Abstract: Transformers commonly exhibit an attention sink: disproportionately high attention to the first position. We study this behavior in GPT-2-style models with learned query biases and absolute positional embeddings. Combining structural analysis with causal interventions, validated across natural-language, mathematical, and code inputs, we find that the sink arises from the interaction among (i) a learned query bias, (ii) the first-layer MLP transformation of the positional encoding, and (iii) structure in the key projection. Crucially, each component we identify is individually dispensable: architectures omitting each of them robustly exhibit sinks. This indicates that attention sinks may arise through distinct circuits across architectures. These findings inform mitigation of sinks, and motivate broader investigation into why sinks emerge.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper identifies a specific circuit—comprising learned query bias, MLP-enhanced positional encoding, and structured key projections—that causes GPT-2's attention sinks.
It employs targeted causal interventions and ablation experiments to demonstrate that this circuit is both necessary and sufficient for the anomalous first-position attention.
Results highlight the need for architecture-specific mitigation strategies to address attention sinks in Transformer-based models.

Mechanistic Dissection of Attention Sinks in GPT-2: Circuit Architecture and Implications

Introduction

The paper "A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation" (2604.14722) provides an in-depth causal analysis of the attention sink phenomenon in Transformer models, specifically GPT-2 architectures employing learned query biases and absolute positional encodings. An attention sink is the persistent, semantically invariant concentration of attention on the initial position in a sequence. This effect is robust across datasets, tasks, and even varying architectural choices, but its underlying cause remains only partially understood.

The authors integrate structural analysis and targeted causal interventions to explicitly identify a parameter-level circuit responsible for the sink in GPT-2. They empirically demonstrate that the interaction of (i) learned query bias, (ii) first-layer MLP-transformed positional encoding, and (iii) structured key projections is both necessary and sufficient to produce the large source-agnostic score shift driving the attention sink. Importantly, the paper shows that this specific circuit is not universal, highlighting the necessity for architecture-aware mitigation of attention sinks.

Theoretical Framework and Mechanism

The attention computation yields a pre-softmax logit matrix in which, for every head and layer, the term

$\Delta_{j,h}^{(l)} = b_{Q,h}^{(l)} W_{k,h}^{(l)\top} x_j^{(l)\top}$

constitutes a target-position-specific but source-invariant additive shift. The paper rigorously demonstrates that for GPT-2, $\Delta_{1,h}^{(l)}$ (where $j=1$ ) is consistently and anomalously large, biasing all tokens to disproportionately attend to position~1, the canonical "sink" position.

Figure 1: The source-agnostic shift $\Delta_{j,h^{(l)}}$ is systematically large at the first position, reflecting a strong prior toward attending to the first token.

To parse the origin of this effect, the authors introduce the concept of "Effective Positional Encoding" (EPE), defined as

$\mathrm{EPE}_i = \mathrm{MLP}^{(1)}(p_i) + p_i$

where $p_i$ is the learned positional embedding and $\mathrm{MLP}^{(1)}$ is the first-layer feed-forward block. Through ablation and alignment studies, the paper demonstrates (i) that $\mathrm{EPE}_1$ at position~1 exhibits massive, input-invariant activations, (ii) has strong alignment with the learned query bias post-key-projection, and (iii) is dominant in the coordinates with largest key-projection—creating a circuit for sink formation.

Figure 2: The first-position key projection $\mathrm{EPE}_1 W_{k,h}$ is strongly aligned with the query bias, driving the sink.

Figure 3: $\mathrm{EPE}$ robustly matches the actual positional signal over all positions, with nearly perfect correspondence at position~1.

Figure 4: $\Delta_{1,h}^{(l)}$ 0 is large exactly where the bias projection is large, indicating coordinate-level co-adaptation.

Empirical and Causal Validation

The authors' methodology is comprehensive: they compute per-head, per-layer $\Delta_{1,h}^{(l)}$ 1 statistics across standardized datasets and validate the source of the sink through four distinct lenses:

Magnitude Separation: Unambiguous separation in the distributions of $\Delta_{1,h}^{(l)}$ 2 for $\Delta_{1,h}^{(l)}$ 3 versus all other positions demonstrates an architectural effect not attributable to semantics.
EPE Validation: Cosine similarity between $\Delta_{1,h}^{(l)}$ 4 and the net added positional signal $\Delta_{1,h}^{(l)}$ 5 exceeds $\Delta_{1,h}^{(l)}$ 6 globally and $\Delta_{1,h}^{(l)}$ 7 at position~1, confirming the EPE as an operational proxy for positional signal injection.
Alignment Analysis: $\Delta_{1,h}^{(l)}$ 8 is highly aligned with $\Delta_{1,h}^{(l)}$ 9 uniquely at position~1, breaking this alignment at other positions.
Coordinate-Level Co-Adaptation: "Massive activation" coordinates in $j=1$ 0 correspond to outlying, high-magnitude elements in the bias-projected key weights, confirming subvector specialization for the sink circuit.
Figure 5: Disrupting any component of the $j=1$ 1– $j=1$ 2– $j=1$ 3 pathway significantly diminishes the sink; observed via interventions on the attention map.

Causal Interventions and Necessity/Sufficiency

The paper's strongest claims rest on a suite of targeted forward-pass interventions. Nullifying the query bias, removing the first positional encoding, swapping EPEs, disabling MLPs, or zeroing the top- $j=1$ 4 coordinates of $j=1$ 5 at "massive activation" indices each significantly diminishes the sink, as quantified by the BOS-attention metric (e.g., reducing from $j=1$ 6 baseline to $j=1$ 7 or $j=1$ 8; see also decrease to $j=1$ 9 or $\Delta_{j,h^{(l)}}$ 0 of baseline values depending on intervention). Control interventions—e.g., nullifying the BOS token embedding, swapping raw positional embeddings, or zeroing random $\Delta_{j,h^{(l)}}$ 1 coordinates—have negligible effect, eliminating the possibility of confounds.

Distributed Circuitry and Architectural Specificity

A central claim, underpinned by the authors' broader analysis and supported by the ablation experiments, is that none of the three circuit components (query bias, first-layer MLP-processed position, or key-projection structure) are universally necessary across all Transformer variants. Alternative architectures without $\Delta_{j,h^{(l)}}$ 2, MLP blocks, or explicit positional encoding still exhibit sinks, implying that distinct architectures construct alternative mechanistic solutions for the sink bias (e.g., using a constant-feature slice of $\Delta_{j,h^{(l)}}$ 3 to substitute for $\Delta_{j,h^{(l)}}$ 4). This property is strongly corroborated by parallel empirical and theoretical work (Gu et al., 2024, Sun et al., 2024, Qiu et al., 10 May 2025, Qiu et al., 30 Jan 2026), which identifies other possible formations and mitigating strategies.

Implications for Mitigation and Future Directions

The analysis directly influences both theoretical and practical perspectives:

Mitigation Strategies Must Be Architecture-Specific: Because there is no universal parameter pathway for the attention sink, attempts to mitigate it via pre-training architectural edits (e.g., removing $\Delta_{j,h^{(l)}}$ 5) or post-training corrections (e.g., zeroing positional encoding) are likely to be subverted by optimization reconstructing alternative circuits.
Functional Role and Robustness: While the mechanistic origin is dissected, the paper does not establish a functional rationale for the sink (beneficial, harmful, or both). Other works suggest plausible roles, such as countering over-mixing or guaranteeing computational compositionality (Barbero et al., 3 Apr 2025, Ran-Milo, 12 Mar 2026).

Future research directions are clear: large-scale, cross-architecture mechanistic studies are needed to enumerate sink circuits in non-GPT-2 families (e.g., RoPE, ALiBi, Mamba), investigate learning-stage dynamics of the sink pathway emergence, and design robust, architecture-aware mitigation schemes for diverse application domains.

Figure 6: Full (non-truncated) histogram of the source-agnostic shift $\Delta_{j,h^{(l)}}$ 6 confirming the comparative outlier status of the first position.

Figure 7: A few coordinates of $\Delta_{j,h^{(l)}}$ 7 exhibit massive activations, serving as specialized conduits for the sink mechanism.

Figure 8: Full (non-truncated) histogram of $\Delta_{j,h^{(l)}}$ 8, highlighting the outlier property of the "massive activation" coordinates.

Conclusion

This work decisively elucidates the mechanistic underpinnings of attention sinks in GPT-2-class architectures, tracing the effect to a specific, dissectible circuit formed by the interaction of learned query bias, first-layer MLP-amplified positional encoding, and key projection specialization. Through causal interventions, it is established that this circuit is both sufficient and necessary in GPT-2, but not a universal mechanism, motivating per-architecture mitigation strategies. While secondary contributors remain and the functional utility is unresolved, this mechanistic advance lays the foundation for a new class of robust, causal interpretability and targeted model editing for Transformer-based systems.

Markdown Report Issue