Grokking Phase Transition in Neural Nets
- Grokking phase transition is a phenomenon in over-parameterized neural networks where perfect training accuracy precedes delayed, abrupt generalization.
- The analysis employs rigorous order parameters, such as the Representation Quality Index and critical exponents, to map transitions between memorization and structural comprehension.
- Investigations reveal that tuning regularization, data regimes, and model efficiency can induce or suppress this dynamic phase shift in learning behavior.
Grokking phase transition denotes a distinctive learning dynamic in over-parameterized neural networks (and related systems) where perfect training accuracy is achieved well before any emergence of generalization, followed—often after many thousands or millions of additional training steps—by an abrupt and dramatic onset of test accuracy. This phenomenon displays haLLMarks of a dynamical phase transition and has been investigated at multiple theoretical and empirical levels in both neural and non-neural models.
1. Characterization of the Grokking Phase Transition
Grokking is defined as the delayed emergence of generalization long after training loss has reached zero. Empirically, networks operating on algorithmic tasks (e.g., modular arithmetic, parity, arithmetic on groups) achieve perfect or near-perfect training accuracy rapidly, while validation/test accuracy remains at chance. Only after a substantial “stall” does a sharp transition occur, resulting in perfect or near-perfect test accuracy (Liu et al., 2022, Žunkovič et al., 2022, Nanda et al., 2023, Qiye et al., 14 Dec 2024). This phase transition is frequently observed as a bifurcation between qualitative learning regimes: initial memorization followed by structural comprehension.
Several canonical phases are identified:
- Comprehension (rapid generalization and representation formation)
- Grokking (memorization precedes generalization, with delayed emergence of structure)
- Memorization (memorizing decoder with unstructured embeddings, no generalization)
- Confusion (insufficient fitting of even the training set) These phases are frequently mapped out in phase diagrams over hyperparameters, such as decoder learning rate and weight decay (Liu et al., 2022).
The transition is further characterized by a quantitative order parameter, such as the Representation Quality Index (RQI), which measures the alignment of learned representations with the true algorithmic structure of the task (e.g., parallelogram relations in embeddings corresponding to arithmetic laws). The phase transition is marked by a sharp jump in RQI—in synchrony with the jump in test accuracy (Liu et al., 2022, Žunkovič et al., 2022).
2. Theoretical Frameworks and Analytical Results
Grokking has been modeled rigorously as a phase transition, with a variety of interpretations and order parameters:
- Effective Theory Approach: An effective loss, as a function of normalized embeddings, is defined whose minimization predicts the emergence of structured representations. The embedding quality is measured by the number of parallelogram relations satisfied within the embedding space.
- Critical Exponents: In analytically tractable perceptron and ball models, the test error close to the transition scales as with critical exponents dependent on the feature space's dimension and data distribution geometry (Žunkovič et al., 2022).
- First-Order & Second-Order Transitions: The transition can manifest as either a first-order (discontinuous, with a free energy landscape supporting multiple minima) or a continuous (second-order) transition, depending on the model and the underlying task (Rubin et al., 2023, Žunkovič et al., 2022). In particular, for two-layer networks and modular arithmetic, there are sharp, first-order transitions in test performance as control parameters are varied.
In classification with logistic loss, the grokking phase transition is amplified in data that are on the verge of linear separability, with long overfitting plateaus near the interpolation threshold, directly analogous to critical slowing down in physical systems (Beck et al., 6 Oct 2024).
3. Mechanistic and Structural Interpretability
Mechanistic analyses reveal that grokking corresponds to the discovery and amplification of structured computational mechanisms. For modular addition, reverse engineering shows that transformers implement a sparse “Fourier multiplication” algorithm, using DFT-based representations and trigonometric identities to mirror algorithmic composition (Nanda et al., 2023).
- Progress Measures such as “restricted loss” (performance when only key Fourier frequencies are kept) and “excluded loss” (when these are ablated) demonstrate that while the test error transitions abruptly, underlying circuit formation progresses gradually until a critical “cleanup” phase when memorization remnants are removed and generalization ensues (Nanda et al., 2023).
- Structural Transition: In models trained on parity or modular arithmetic, a dense subnetwork (many contributing neurons, poor generalization) is eventually supplanted by a sparse subnetwork (few dominant neurons, perfect generalization) via rapid norm growth in select neurons. This sparsification aligns with classic results on representational minimality for algorithmic tasks (Merrill et al., 2023).
Mechanistic and information-theoretic tools (including higher-order mutual information and the O-information) expose the onset of grokking as a shift from redundancy-dominated (memorization) to synergy-driven (collective, generalizing) circuits (Clauw et al., 16 Aug 2024).
4. Role of Data Regimes, Regularization, and Model Efficiency
The onset and timescale of grokking are sensitive to several factors:
- Critical Data Size (): Only when the training data exceed a task/model-specific threshold does delayed generalization occur. Below , only memorization is possible; near , grokking occurs; and far above, memorization and generalization are concurrent (Zhu et al., 19 Jan 2024). The increases with model size.
- Weight Decay and Initialization: The classical “Goldilocks zone” is defined by an intermediate range of decoder capacity and regularization (e.g., weight decay, learning rate), allowing enough pressure for the model to discover structured solutions rather than falling into confusion or rote memorization. Tuning weight decay and initialization can induce, delay, or suppress grokking (Liu et al., 2022, Clauw et al., 16 Aug 2024).
- Circuit Efficiency: Grokking occurs when training implicitly transitions from an inefficient, high-norm memorization circuit to an efficient generalization circuit; the latter yields larger logits for a fixed norm and is favored by weight decay. For dataset sizes above the critical threshold, the memorizing circuit becomes less efficient and is outcompeted (Varma et al., 2023).
Novel behaviors include ungrokking (reverting from generalization back to memorization when training data are reduced midrun) and semi-grokking (delayed generalization to intermediate test accuracy at the threshold) (Varma et al., 2023).
5. Compression, Complexity, and Feature Learning
The grokking transition aligns with implicit compression—the reduction in operational model complexity:
- Linear Mapping Number (LMN): Introduced as a complexity measure, LMN decreases linearly with the test loss as the model compresses from a high-complexity memorization solution to a low-complexity generalization solution. LMN is argued as a Kolmogorov complexity analogue for neural networks (Liu et al., 2023).
- Feature Rank: In deep MLPs, sharp drops in the internal feature rank, often following double-descent patterns, are empirically tied to the transition to generalization. These changes in feature space rank serve as more sensitive indicators of effective phase transitions than global weight-norm decreases (Fan et al., 29 May 2024).
- Tensor Network Models: Even in MPS (Matrix Product State) models, grokking manifests as an entanglement transition: the entanglement entropy switches from a volume law (random-like representation) to a sub-volume law (localized, structured information), with generalization marked by eigenvalue evaporation in the singular spectrum (Pomarico et al., 13 Mar 2025).
6. Nuanced Phenomena: Anti-Grokking and Glassy Relaxation
Recent research highlights rich dynamics beyond initial grokking:
- Anti-Grokking: A new phase where, after a period of perfect test accuracy, the model’s generalization collapses and test accuracy precipitously falls, despite perfect training accuracy. This generalization collapse is associated with the correlation trap: certain layers develop outlier singular values (detected via the HTSR < 2 and deviations from the Marchenko-Pastur law), signaling over-correlation and imminent overfitting (Prakash et al., 4 Jun 2025). None of the standard progress measures (weight norm, sparsity, standard entropy) can differentiate grokking from this collapse, but HTSR robustly tracks all three phases.
- Glassy Relaxation Analogy: Grokking is framed as a computational glass relaxation rather than a barrier-limited first-order transition. The memorization phase is analogous to rapid quenching into a glassy, non-equilibrium state; the grokking transition corresponds to slow entropic relaxation toward equilibrium. Empirical sampling of the Boltzmann entropy landscape reveals no entropy barrier between memorization and generalization—a challenge to first-order transition models—and emphasizes the advantage of high-entropy generalizing solutions. Optimizers designed via Wang-Landau molecular dynamics can eliminate grokking entirely, supporting an entropic, rather than norm-constrained, perspective (Zhang et al., 16 May 2025).
7. Implications, Practical Considerations, and Future Directions
The body of work on grokking phase transitions shows that:
- Delayed generalization is a macroscopic signature of hidden, gradual structural changes (whether in circuit topology, mutual information, or entanglement) that only manifest observably after a critical threshold is passed.
- Tuning network regularization, initialization, and data fraction can sharply alter or even prevent the occurrence of grokking.
- Structural exploration—identifying and promoting efficient, sparse, or high-entropy subnetworks—can accelerate generalization and prevent generalization collapse. Notably, structural signatures such as periodicity, block-circulant structure, and spectral “alpha” metrics provide robust diagnostic tools and, in some cases, the ability to predict overfitting without test data.
- There is no consensus that the phase transition is always first-order; empirical entropy sampling finds no barrier, supporting a glass-relaxation analogy.
- Information-theoretic and physics-inspired tools (e.g., progress measures, entropy landscapes, and spectral statistics) are increasingly central to both understanding and controlling the emergence and stability of generalization.
Future work is suggested in scaling mechanistic interpretability to larger models and broader domains, further exploration of glassy/entropic optimization strategies, formalizing structural diagnostics (e.g., HTSR -based regularization), and linking these phase transitions to algorithmic generalization in even more complex systems.