Papers
Topics
Authors
Recent
Search
2000 character limit reached

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

Published 10 Jun 2026 in stat.ML, cond-mat.dis-nn, and cs.LG | (2606.12058v1)

Abstract: Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training LLMs.

Summary

  • The paper introduces a Bayesian framework that analytically characterizes phase transitions in attention and the emergence of copy heads.
  • It employs closed-form analysis of the attention posterior to reveal distinct phase transitions: a continuous shift in linear attention versus an abrupt jump in softmax attention.
  • Empirical validations align theoretical predictions with loss curves and order parameters, offering insights into in-context learning circuit development.

Phase Transitions in Attention: Bayesian Theory and Copy Head Emergence

Overview

"Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence" (2606.12058) develops a principled analytical framework to characterize how structured attention patterns, notably copy heads as the first subcircuit of induction heads, emerge in Transformer architectures. The approach utilizes a Bayesian feature learning perspective, enabling closed-form analysis of the attention matrix posterior and explicit identification of low-dimensional order parameters. These order parameters reveal phase transitions—abrupt qualitative shifts in learned attention mechanisms—whose nature (first- or second-order) is shown to depend crucially on the choice of attention activation (softmax vs. linear). The results provide mechanistic clarity on the emergence of in-context learning (ICL) circuits and offer predictions for loss landscapes and empirical transition points.

Analytical Framework and Model

The work isolates the copy mechanism—essential for induction heads and bigram ICL—in a tractable synthetic setting: a shift-by-one copy task with i.i.d. one-hot sequence inputs and their shifted copies as targets. It employs a single-layer, single-head attention model parameterized such that weights and activations admit large-deviation and Gaussian process analysis.

The Bayesian formulation proceeds by:

  • Marginalizing over model parameters to directly obtain the posterior over attention matrices.
  • Reducing this high-dimensional distribution to a two-dimensional order parameter space (c^1,c^G)(\hat{c}_1, \hat{c}_G) for linear attention (uniform pooling and copy modes), or a single parameter after the softmax constraint.
  • Analyzing the negative log-posterior (action) landscape to identify phase transitions as a function of training dataset size PP.

Scaling analyses are derived in a regime with large model dimension and strong regularization, leveraging equivalent kernel approximations.

Emergence of Structured Attention and Transition Order

Linear Attention

For linear attention, the transition in learned attention patterns occurs via a second-order phase transition. As the number of training sequences PP increases:

  1. Pooling Phase: Initially, the model does not utilize context; all attention weights are uniform.
  2. Second-Order Transition: At a critical P1^VP^{*}_{\hat{1}} \sim V, uniform pooling emerges abruptly—order parameter c^1\hat{c}_1 increases from zero—enabling exploitation of unigram statistics.
  3. Crossover to Copy: At a second (larger) scale PG^L3/2V1/2P^{*}_{\hat{G}} \sim L^{3/2}V^{1/2}, the order parameter c^G\hat{c}_G governing the copy pattern begins to increase continuously, overtaking c^1\hat{c}_1. This stage is a crossover, not a genuine second transition.

This structure is analytically traced to an approximate symmetry in the action, which is broken at the pooling transition and results in a smooth reorganization to the copy regime, manifesting in progressive sharpening of the attention pattern and a gradual loss decrease.

Softmax Attention

In stark contrast, softmax attention exhibits a single, first-order phase transition at a critical sample size PG^LlogLP^{*}_{\hat{G}} \sim L \log L:

  • The action landscape supports two minima (uniform and copy). With increasing PP, the global minimum jumps discontinuously from uniform to copy, with the copy order parameter PP0 exhibiting an abrupt increase.
  • This yields a sharp, non-analytic drop in loss and no precursor indicative of the coming transition.

The fused analytical posterior directly quantifies the critical points and sample complexity scaling in both cases.

Quantitative Results and Empirical Validation

  • Theoretical predictions align quantitatively with empirical training trajectories of networks optimized via both Adam and SGLD, across varying PP1, sequence lengths PP2, and vocabulary sizes PP3.
  • In linear attention, order parameters PP4 and loss curves display two loss plateaus separated by the predicted transitions, with strong agreement between theory and experiments.
  • For softmax attention, the observed abrupt loss drop precisely matches the predicted first-order boundary, independent of optimizer choice.
  • The transition types map cleanly to observable metrics: for linear attention, fluctuations in order parameters precede the transition (enabling early warning), while for softmax, no such precursor exists.

Implications for ICL, Predictability, and Monitoring

Predictability of Emergence: The transition order impacts the ability to anticipate emergent capabilities. In softmax attention (first-order), structured behavior emerges with no continuous progress indicators, foreshadowing intrinsic unpredictability in the onset of attention-mediated ICL capacities. For linear attention (second-order/crossover), precursor signals—elevated fluctuations—enable (partial) forecasting prior to full capability expression.

Phase Boundaries and Scaling Laws: The derived expressions for critical sample complexity provide theoretical scaling laws predicting when copy circuits will emerge as a function of data size, sequence length, and vocabulary size, supporting the construction of monitoring regimes for capability acquisition.

Mechanistic Clarity: The Bayesian perspective distinguishes the mechanistic origin of the transition type as rooted in the attention nonlinearity (softmax vs. linear), resolving empirical ambiguities about the abruptness of induction head formation and clarifying architectural contributions to emergent phenomena.

Monitoring and Safety: The results expose fundamental limits: for architectures with first-order transitions, reliable monitoring of impending capability shifts may be provably impossible from observable (loss/parameter) trajectories. Second-order/crossover transitions may still permit robust progress tracking, emphasizing the need for mechanistic theory in safety-aligned forecasting.

Limitations and Future Directions

  • The analysis focuses on single-head, single-layer attention and a supervised copy task to enable tractability; extension to multi-layer, multi-head, and causal-masked settings remains open.
  • The framework relies on i.i.d./uniform inputs and large-PP5 equivalent-kernel regimes; investigating structured/natural data, finite-size effects, and alternative parameterizations is a prospective avenue.
  • The future integration of this theory with scaling law approaches and application to more complex ICL behaviors—beyond bigram tasks and copy heads—could provide comprehensive scaling predictivity for emergent circuits in LLMs.

Conclusion

This study formulates a general Bayesian theory elucidating phase transitions in attention networks and explicates the abrupt emergence of copy heads as a function of data availability. The work categorically demonstrates that the attention activation function determines the transition order—continuous (second-order/crossover) for linear attention versus discontinuous (first-order) for softmax. These findings yield fine-grained predictions for when and how ICL circuits emerge and supply rigorous foundations for tracking and interpreting capability transitions in Transformer models. The implications are significant for scaling-law design, mechanistic interpretability, and the theoretical limits of monitoring emergent behaviors in large-scale deep learning systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.