Papers
Topics
Authors
Recent
2000 character limit reached

Grokking: Delayed Generalization

Updated 4 December 2025
  • The paper demonstrates that grokking is a delayed generalization phenomenon where training accuracy saturates before a sudden improvement in test performance.
  • It utilizes experimental protocols on a 2D Ising model with PCA diagnostics and weight sparsity measures to reveal phase transitions in network structure.
  • The study shows that regularization and weight decay drive the network from dense memorization to a sparse, physics-aligned subnetwork that enables effective generalization.

Grokking, equivalently termed delayed generalization, is a training phenomenon wherein a machine learning model achieves near-perfect training accuracy long before it manifests acceptable performance on held-out test data. During grokking, the training metric saturates early (memorization), while the generalization metric—most often test accuracy or test loss—remains flat near random-guess for an extended period before a sharp, sudden improvement. This behavior has been observed in feedforward networks tasked with 2D Ising model configuration classification, algorithmic datasets, vision-classification, and other learning regimes. Mechanistically, grokking is typically associated with phase transitions in network structure and feature representation, driven by regularization, weight decay, loss landscape geometry, and optimization dynamics (Hutchison et al., 29 Oct 2025, Liu et al., 2022, Humayun et al., 23 Feb 2024).

1. Formal Characterization of Grokking

Grokking is rigorously captured by two time-scales: fitting time TfitT_\mathrm{fit} when the training accuracy Atrain(t)A_\mathrm{train}(t) approaches its maximum (often unity), and generalization time TgrokT_\mathrm{grok} when test accuracy Atest(t)A_\mathrm{test}(t) rises sharply. The delay

Δgrok=TgrokTfit\Delta_\mathrm{grok} = T_\mathrm{grok} - T_\mathrm{fit}

quantifies the grokking window (Hutchison et al., 29 Oct 2025). Formally, for a loss function \ell and prediction fθ(x)f_\theta(x), one can write the train and test accuracy at epoch tt:

Atrain(t)=1Ntraini=1NtrainI[yi=y^i(t)]A_\mathrm{train}(t) = \frac{1}{N_\mathrm{train}} \sum_{i=1}^{N_\mathrm{train}} \mathbb{I}\left[y_i = \hat{y}_i(t)\right]

Atest(t)=1Ntesti=1NtestI[yi=y^i(t)]A_\mathrm{test}(t) = \frac{1}{N_\mathrm{test}} \sum_{i=1}^{N_\mathrm{test}} \mathbb{I}\left[y_i = \hat{y}_i(t)\right]

where I\mathbb{I} is the indicator function. Grokking is the regime where Atrain(t)A_\mathrm{train}(t) saturates at TfitT_\mathrm{fit} while Atest(t)A_\mathrm{test}(t) lingers near chance until TgrokT_\mathrm{grok} (Hutchison et al., 29 Oct 2025, Liu et al., 2022, Kozyrev, 17 Dec 2024).

2. Experimental Protocols and Diagnostics

In the 2D Ising model context, grokking was studied using an MLP with 4 hidden layers (48 neurons each), ReLU activation, softmax output, and cross-entropy loss. Weight decay (λ=5×104\lambda = 5 \times 10^{-4}) and unbiased initialization were fixed. Data comprised 8×88 \times 8 lattices equilibrated by Metropolis updates, with energies binned into four equipopulated classes over [128,+128][-128,+128] and two types of test sets: random spins and nn-inverted (flipped) spins (Hutchison et al., 29 Oct 2025).

Quantitative diagnostics track the training/test losses Ltrain(t),Ltest(t)L_\mathrm{train}(t), L_\mathrm{test}(t) and accuracies, supplemented by:

  • Sparsity measures: Fraction of near-zero weights

s=1Mj=1MI[wj<ϵ],    ϵ103s = \frac{1}{M} \sum_{j=1}^M \mathbb{I}\left[|w_j| < \epsilon\right], \;\; \epsilon \approx 10^{-3}

  • PCA-based layer analysis: Eigenvalue spectrum, number of principal components to reach 90% variance (k90%k_{90\%}), and class-separation in PC projections.

During grokking, k90%k_{90\%} (variance fraction) initially rises (network confusion increases) then collapses abruptly, reflecting a transition to low-dimensional structure. Activation clustering by class becomes visible only post-grokking (Hutchison et al., 29 Oct 2025).

3. Structural and Geometric Phase Transition

Grokking is interpreted as a network-structural phase transition. Initially, the network is densely connected: most ReLUs fire and weights are O(1)O(1). As training proceeds through the grokking window [Tfit,Tgrok][T_\mathrm{fit}, T_\mathrm{grok}], weight decay drives many connections to near-zero, with substantial fractions of ReLUs becoming permanently inactive. What remains is a sparse subnetwork, which is responsible for classification (Hutchison et al., 29 Oct 2025).

Key signatures include:

  • Weight histogram transitions: Broad distributions become sharply peaked at zero.
  • PCA spectra: The number of significant components peaks and then drops.
  • Dense (“liquid”) to sparse (“solid”) subnetwork formation.

This behavior is analogized to a first-order thermodynamic phase transition: the model’s output entropy proxy,

ΔP(t)=PmaxPsecond\Delta P(t) = \langle P_\mathrm{max} - P_\mathrm{second} \rangle

shows characteristic dips at the critical epoch, reminiscent of peaks in specific heat in physical phase transitions (Hutchison et al., 29 Oct 2025).

4. Mechanisms Underpinning Delayed Generalization

Before grokking, high path multiplicity and feature confusion dominate—the network can memorize but cannot generalize. As weight decay (or other regularization) suppresses non-essential paths and compresses the parameter space, the network is driven to discover and utilize a sparse “lottery ticket” subnetwork that encodes class-distinguishing physical features. The dimensionality reduction in weight, gradient, and activation space minimizes mutual confusion and eliminates alternative pathways, allowing generalizable class clustering in activation space (Hutchison et al., 29 Oct 2025, Liu et al., 2023).

Furthermore, representation learning requirements and initialization scale modulate grokking delay (Liu et al., 2022). Non-trivial feature or embedding representations force the network to remain in overfit (high-norm) regimes until suitable structures emerge, at which point test performance surges.

5. Metrics and Computational Insights

Grokking manifests across several metrics:

  • Loss landscape shape: Training loss is “L”-shaped; test loss is “U”-shaped versus weight norm.
  • Weight and activation sparsity: Sharp increase post-grokking in near-zero weights and active units.
  • PCA eigenstructure: Compression of component count at grokking.
  • Output entropy: Dip at critical transition (Hutchison et al., 29 Oct 2025, Liu et al., 2022).

Network compression, measured via linear mapping number (LMN), further evidences grokking as a phase of reduced complexity, linearly correlated with improvements in test loss. LMN tracks the effective number of distinct linear maps executed by the network; its monotonic decline reflects compression and readiness to generalize (Liu et al., 2023).

6. Synthesis and Generalization

Grokking in the Ising-model MLP and similar architectures is best conceptualized as a delayed, network-structure phase transition: from a dense, high-entropy, overfitted regime to a sparse, low-entropy, physics-aligned subnetwork that generalizes to unseen data. This transition is driven by regularization and the optimization dynamics traversing the loss landscape. Analytic and PCA-based diagnostics, alongside entropy-proxy measures and sparsity statistics, robustly signal this behavior. The phenomenon demonstrates that memorization is fundamentally easier than generalization, with the latter demanding the formation of compressed, physically valid representations—enabled only after prolonged exploration of parameter space (Hutchison et al., 29 Oct 2025).

References:

Grokking in the Ising Model (Hutchison et al., 29 Oct 2025); Omnigrok: Grokking Beyond Algorithmic Data (Liu et al., 2022); Grokking as Compression: A Nonlinear Complexity Perspective (Liu et al., 2023); Deep Networks Always Grok and Here is Why (Humayun et al., 23 Feb 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Grokking-Delayed Generalization.