Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

Published 10 Jun 2026 in stat.ML, cond-mat.dis-nn, and cs.LG | (2606.12058v1)

Abstract: Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training LLMs.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a Bayesian framework that analytically characterizes phase transitions in attention and the emergence of copy heads.
It employs closed-form analysis of the attention posterior to reveal distinct phase transitions: a continuous shift in linear attention versus an abrupt jump in softmax attention.
Empirical validations align theoretical predictions with loss curves and order parameters, offering insights into in-context learning circuit development.

Phase Transitions in Attention: Bayesian Theory and Copy Head Emergence

Overview

"Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence" (2606.12058) develops a principled analytical framework to characterize how structured attention patterns, notably copy heads as the first subcircuit of induction heads, emerge in Transformer architectures. The approach utilizes a Bayesian feature learning perspective, enabling closed-form analysis of the attention matrix posterior and explicit identification of low-dimensional order parameters. These order parameters reveal phase transitions—abrupt qualitative shifts in learned attention mechanisms—whose nature (first- or second-order) is shown to depend crucially on the choice of attention activation (softmax vs. linear). The results provide mechanistic clarity on the emergence of in-context learning (ICL) circuits and offer predictions for loss landscapes and empirical transition points.

Analytical Framework and Model

The work isolates the copy mechanism—essential for induction heads and bigram ICL—in a tractable synthetic setting: a shift-by-one copy task with i.i.d. one-hot sequence inputs and their shifted copies as targets. It employs a single-layer, single-head attention model parameterized such that weights and activations admit large-deviation and Gaussian process analysis.

The Bayesian formulation proceeds by:

Marginalizing over model parameters to directly obtain the posterior over attention matrices.
Reducing this high-dimensional distribution to a two-dimensional order parameter space $(\hat{c}_1, \hat{c}_G)$ for linear attention (uniform pooling and copy modes), or a single parameter after the softmax constraint.
Analyzing the negative log-posterior (action) landscape to identify phase transitions as a function of training dataset size $P$ .

Scaling analyses are derived in a regime with large model dimension and strong regularization, leveraging equivalent kernel approximations.

Emergence of Structured Attention and Transition Order

Linear Attention

For linear attention, the transition in learned attention patterns occurs via a second-order phase transition. As the number of training sequences $P$ increases:

Pooling Phase: Initially, the model does not utilize context; all attention weights are uniform.
Second-Order Transition: At a critical $P^{*}_{\hat{1}} \sim V$ , uniform pooling emerges abruptly—order parameter $\hat{c}_1$ increases from zero—enabling exploitation of unigram statistics.
Crossover to Copy: At a second (larger) scale $P^{*}_{\hat{G}} \sim L^{3/2}V^{1/2}$ , the order parameter $\hat{c}_G$ governing the copy pattern begins to increase continuously, overtaking $\hat{c}_1$ . This stage is a crossover, not a genuine second transition.

This structure is analytically traced to an approximate symmetry in the action, which is broken at the pooling transition and results in a smooth reorganization to the copy regime, manifesting in progressive sharpening of the attention pattern and a gradual loss decrease.

Softmax Attention

In stark contrast, softmax attention exhibits a single, first-order phase transition at a critical sample size $P^{*}_{\hat{G}} \sim L \log L$ :

The action landscape supports two minima (uniform and copy). With increasing $P$ , the global minimum jumps discontinuously from uniform to copy, with the copy order parameter $P$ 0 exhibiting an abrupt increase.
This yields a sharp, non-analytic drop in loss and no precursor indicative of the coming transition.

The fused analytical posterior directly quantifies the critical points and sample complexity scaling in both cases.

Quantitative Results and Empirical Validation

Theoretical predictions align quantitatively with empirical training trajectories of networks optimized via both Adam and SGLD, across varying $P$ 1, sequence lengths $P$ 2, and vocabulary sizes $P$ 3.
In linear attention, order parameters $P$ 4 and loss curves display two loss plateaus separated by the predicted transitions, with strong agreement between theory and experiments.
For softmax attention, the observed abrupt loss drop precisely matches the predicted first-order boundary, independent of optimizer choice.
The transition types map cleanly to observable metrics: for linear attention, fluctuations in order parameters precede the transition (enabling early warning), while for softmax, no such precursor exists.

Implications for ICL, Predictability, and Monitoring

Predictability of Emergence: The transition order impacts the ability to anticipate emergent capabilities. In softmax attention (first-order), structured behavior emerges with no continuous progress indicators, foreshadowing intrinsic unpredictability in the onset of attention-mediated ICL capacities. For linear attention (second-order/crossover), precursor signals—elevated fluctuations—enable (partial) forecasting prior to full capability expression.

Phase Boundaries and Scaling Laws: The derived expressions for critical sample complexity provide theoretical scaling laws predicting when copy circuits will emerge as a function of data size, sequence length, and vocabulary size, supporting the construction of monitoring regimes for capability acquisition.

Mechanistic Clarity: The Bayesian perspective distinguishes the mechanistic origin of the transition type as rooted in the attention nonlinearity (softmax vs. linear), resolving empirical ambiguities about the abruptness of induction head formation and clarifying architectural contributions to emergent phenomena.

Monitoring and Safety: The results expose fundamental limits: for architectures with first-order transitions, reliable monitoring of impending capability shifts may be provably impossible from observable (loss/parameter) trajectories. Second-order/crossover transitions may still permit robust progress tracking, emphasizing the need for mechanistic theory in safety-aligned forecasting.

Limitations and Future Directions

The analysis focuses on single-head, single-layer attention and a supervised copy task to enable tractability; extension to multi-layer, multi-head, and causal-masked settings remains open.
The framework relies on i.i.d./uniform inputs and large- $P$ 5 equivalent-kernel regimes; investigating structured/natural data, finite-size effects, and alternative parameterizations is a prospective avenue.
The future integration of this theory with scaling law approaches and application to more complex ICL behaviors—beyond bigram tasks and copy heads—could provide comprehensive scaling predictivity for emergent circuits in LLMs.

Conclusion

This study formulates a general Bayesian theory elucidating phase transitions in attention networks and explicates the abrupt emergence of copy heads as a function of data availability. The work categorically demonstrates that the attention activation function determines the transition order—continuous (second-order/crossover) for linear attention versus discontinuous (first-order) for softmax. These findings yield fine-grained predictions for when and how ICL circuits emerge and supply rigorous foundations for tracking and interpreting capability transitions in Transformer models. The implications are significant for scaling-law design, mechanistic interpretability, and the theoretical limits of monitoring emergent behaviors in large-scale deep learning systems.

Markdown Report Issue