Transition Phenomena in Transformers

Updated 1 May 2026

Transition in Transformers is a phenomenon involving qualitative reorganizations in dynamics, geometry, and computation due to changes in training, inference, or initialization.
Key insights reveal transitions analogous to physical phase changes, including order–chaos boundaries, token clustering, and shifts in attention structure that drive model behavior.
Practical implications suggest that tuning initialization and architecture parameters can harness these transitions to optimize trainability and generalization.

A transition in transformers refers to qualitative reorganizations in the dynamics, geometry, or internal computations of transformer models, occurring either during training, inference, or as a function of architecture or initialization. These transitions manifest as abrupt changes in loss curves, activation geometry, attention structure, in-context algorithm behaviors, and trainability, often aligning with core notions of phase transition from statistical mechanics. Transitions in transformers have been observed at multiple scales: in training dynamics (e.g., grokking), geometric propagation (order-chaos boundary), in-context learning regimes, and token clustering. Recent studies have mapped these transitions to the emergence of linguistic, logical, or computational structures within the network, providing rigorous, often spectral or order-parameter-based signatures of critical boundaries between distinct phases.

1. Geometric and Dynamical Phase Transitions in Deep Transformers

The geometry of signal propagation in deep, randomly initialized transformers exhibits a sharp order–chaos phase transition as a function of initialization hyperparameters controlling the strength of attention and MLP residuals and weight variances. This can be quantitatively described by tracking the evolution of the Gram matrix of token representations as a permutation-symmetric dynamical system. In the pure self-attention limit (no MLPs), networks collapse to a rank-1 manifold, meaning all tokens align and exhibit minimal variability. Injecting nonlinear MLP residuals introduces a competition: in the ordered regime, token embeddings collapse towards a line; in the chaotic regime they separate into the vertices of a regular $n$ -simplex, corresponding to maximal mutual separation among tokens. The phase boundary is captured by two Lyapunov exponents: the angle exponent governs the stability of directions in representation space, and the gradient exponent governs the stability of backpropagated gradients. Minimal test loss is achieved at the simultaneous vanishing of both exponents, identifying a codimension-2 critical surface in hyperparameter space—jointly necessary and sufficient for trainability in deep stacks (Cowsik et al., 2024).

2. Stochastic Initialization and Token Clustering Transitions

Transformers commonly employ random initialization of key, query, and especially value matrices. When these stochastic effects are accounted for, a fundamentally new class of phase transition emerges in the token geometry. In the infinite-depth and diffusion-scaling limit under layerwise RMS normalization, token trajectories converge to a system of interacting particle stochastic differential equations (SDEs) on the unit sphere, driven by a shared Brownian matrix noise. Here, deterministic regimes predict total cluster collapse (rank-1 geometry), but intrinsic noise permits bifurcations: for two tokens, a sharp phase boundary at $\beta_c(d) = \frac{1}{2}\cosh^{-1}(d-2)$ separates a "collapsed" (single cluster) phase from one in which antipodal configurations can arise and become attracting. This antipodal bifurcation extends to $N>2$ tokens, inducing persistent, stable multicentric clusters. Empirical evidence confirms that suppressing initialization noise abolishes these new phases and significantly degrades downstream accuracy; intrinsic noise is thus both mathematically indispensable and beneficial for trainability and expressivity (Fedorov et al., 29 Jan 2026).

3. Transitions in Training Dynamics and In-Context Computation

Transformers can undergo abrupt computation-regime transitions during training, observable as sudden drops in loss (abrupt learning), shifts from trivial heuristics (e.g., copying) to nontrivial inference (e.g., matrix completion or in-context regression), or discrete shifts in complexity of the learned internal algorithm.

In matrix completion tasks, training proceeds through a copying phase—where the network trivially replicates visible entries—followed by a rapid, coordinated transition to a true completion phase. This shift is triggered only when the correct configuration of positional embeddings, attention patterns, and MLP subcircuits percolates through the network, in analogy to percolation or first-order phase transitions in statistical mechanics (Gopalani et al., 2024).
For in-context linear regression, "transient ridge" transitions appear: transformers first implement a Gaussian-prior-like ridge regression with strong out-of-distribution generalization, then, at a critical training step $t_{\rm crit}$ , specialize to a discrete minimum-MSE solution memorizing the training tasks, as predicted by Bayesian internal model selection and the local learning coefficient (Carroll et al., 29 Jan 2025).
Learning induction heads—mechanisms for implementing in-context copying and retrieval—also exhibits a sharp transition: networks stall on lazy $n$ -gram solutions before abrupt onset of rich induction at critical scales of parameter or head magnitude, with plateaued loss followed by rapid collapse (Wang et al., 2024). Layerwise or multi-stage gradient flow analysis exposes explicit thresholds that govern this transition.

These regime shifts are nonlocal in parameter space and are typically not aligned with gradual changes in standard loss metrics, instead correlating with higher-order geometric or algorithmic diagnostics.

4. Spectral and Topological Phase Transitions in Representation Space

Recent analyses of large transformer manifolds reveal that transformer networks, as depth or scale increases, exhibit statistical-physics style topological phase transitions in their representation geometry, activating multi-step reasoning and object permanence.

Tracking the covariance spectrum of activations across layers, one observes a sharp reduction in effective dimensionality, with the emergence of spectral spikes departing from the random-matrix Marchenko–Pastur bulk, a sudden drop in the participation ratio, and onset of spectral tail collapse at a critical normalized depth $\gamma_c \approx 0.42$ .
The transition is captured quantitatively by an order parameter $\Omega(h) = 1 - \|h\|_1/(\sqrt{d}\|h\|_2)$ , which measures departure from a homogeneous superposition ("liquid") to a sparse, localized regime ("solid object"). Both the mean $\mathbb{E}[\Omega]$ and variance $\operatorname{Var}[\Omega]$ display a discontinuity at the phase transition.
This transition corresponds to the formation of transient class objects (TCOs): low-dimensional, robustly invariant submanifolds associated with logical/semantic class separability, identified via fixed points in a renormalization group-like coarse-graining of the layerwise distribution. In this low-entropy regime, the transformer’s representations support object permanence and multi-step, logically structured reasoning (Alpay et al., 16 Jan 2026).

5. Criticality in Grokking and Cascade Geometry

Critical transitions are also observed during the "grokking" phenomenon: late-phase, sudden generalization in overparameterized models after an extended period of perfect memorization.

Using thresholded diffusion update/Olami–Feder–Christensen (TDU–OFC) avalanche probes, the effective cascade dimension $D(t)$ of gradient updates is tracked. A sharp, temporally localized crossing of $\beta_c(d) = \frac{1}{2}\cosh^{-1}(d-2)$ 0, the Gaussian diffusion fixed point, marks the precise onset of grokking in transformer models, distinguishing memorization ( $\beta_c(d) = \frac{1}{2}\cosh^{-1}(d-2)$ 1) from generalization ( $\beta_c(d) = \frac{1}{2}\cosh^{-1}(d-2)$ 2) (Wang, 6 Apr 2026).
Avalanches exhibit heavy tails in the pre-transition phase, with suppression of these tails post-transition, and finite-size scaling collapse confirms the macroscopic nature of the transition.
Ungrokked runs and shadow-probe controls demonstrate the specificity and non-invasiveness of $\beta_c(d) = \frac{1}{2}\cosh^{-1}(d-2)$ 3 as a robust, global order parameter, providing an early-warning signature for dynamical regime change.

6. Transitions in Attention Structure, Learnability, and Emergent Coherence

Transitions in transformers are closely paralleled by abrupt changes in attention structure and learnability.

In unsupervised tasks, the sharp jump in the ability to learn structured data (quantified by the final self-supervised loss) and the accompanying sudden rise in attention entropy both accurately recover phase boundaries such as the critical temperature in the 2D Ising model. Structured attention blocks and low-entropy patterns emerge in the ordered phase, while delocalized, high-entropy attention characterizes the disordered regime (Özönder, 8 Oct 2025).
In LLMs, careful probes of vocabulary statistics, such as the index of dispersion and Kullback–Leibler divergence from Poisson baselines, reveal abrupt, phase-transition-like reorganizations in the formation and diversity of correct and incorrect words. The transition is not detected by cross-entropy loss or standard validation metrics, emphasizing that higher-order statistics provide true order parameters for linguistic coherence (Hong et al., 16 Nov 2025).

These results indicate that the emergence of structure, from attention patterning to output regularity, tracks discrete phase boundary crossings that mirror physical transitions between disordered and ordered states.

7. Implications and Universality

Transition phenomena in transformer models reveal both mechanistic principles and practical consequences:

Proper stochastic initialization drives the network into dynamically favorable phases, preventing rank collapse and improving downstream accuracy (Fedorov et al., 29 Jan 2026).
The edge-of-chaos and critical-gradient intersections provide a theoretically grounded recipe for choosing initialization hyperparameters to maximize trainability (Cowsik et al., 2024).
Transitions are not artifacts of scale: evidence from small models, synthetic tasks, and controlled statistical probes demonstrates universality, plausibly extending to emergent behaviors in large-scale LLMs and hierarchical representations (Hong et al., 16 Nov 2025).
Spectral, geometric, and algorithmic diagnostics, rather than standard loss curves, are essential for identifying the true phase boundaries governing computation and generalization in transformers.

A plausible implication is that further advances in mechanistic interpretability, initialization protocol, and architecture are likely to arise from a unified treatment of these transition phenomena as genuine, order-parameter-driven phase transitions in deep neural computation.