Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grokking Analysis in Neural Networks

Updated 22 April 2026
  • Grokking analysis is defined by rapid memorization (t1) with near-perfect training accuracy and a prolonged plateau in test performance until a sharp transition occurs at t2.
  • It examines the interplay of optimizer dynamics, regularization, and data distribution, demonstrating phase-transition behaviors in both linear and nonlinear models.
  • Empirical measures like dropout robustness curves and embedding bimodality provide actionable metrics to predict grokking onset and guide model optimization.

Grokking is the phenomenon in machine learning where a model achieves near-perfect accuracy on the training set rapidly (memorization), but requires many additional optimization steps before exhibiting a sudden, sharp transition to high generalization performance (test accuracy). This effect has been observed across a variety of tasks, architectures, and optimization regimes, and challenges classical assumptions underlying convergence and generalization in neural network learning. While originally identified in algorithmic and modular arithmetic tasks, recent studies demonstrate that grokking arises in both synthetic and real-world settings, and can occur in linear as well as deep nonlinear models. Modern research elucidates grokking as an interplay among optimizer dynamics, implicit and explicit regularization, model architecture, data distribution, and phase-transition–like mechanisms.

1. Core Definitions and Canonical Phenomenology

Grokking is formally defined by the delayed alignment of training and test accuracy under continual training. Given model parameters θ(t)\theta(t) at optimization step tt, training loss Ltrain(t)L_{\text{train}}(t) rapidly approaches zero, while test loss Ltest(t)L_{\text{test}}(t) remains elevated (near-chance) for an extended plateau before abruptly dropping (Levi et al., 2023, Carvalho et al., 3 Feb 2025). Precise characterization involves:

  • Memorization time t1t_1: Ltrain(t1)εtrainL_{\text{train}}(t_1) \leq \varepsilon_{\text{train}}.
  • Generalization time t2t_2: Ltest(t2)εtest, t2t1L_{\text{test}}(t_2) \leq \varepsilon_{\text{test}},\ t_2 \gg t_1.
  • Grokking delay Δgrok=t2t1\Delta_{\mathrm{grok}} = t_2 - t_1.

The observed signature is a sharp, step-like rise in test accuracy (or collapse in test loss) at t2t_2, in contrast to the smooth ascent of training accuracy at tt0. This dynamic appears robust across datasets, ranging from modular addition to image and sentiment classification tasks (Golechha, 2024, Carvalho et al., 3 Feb 2025).

2. Mechanistic Models: Linear Solvable Cases and the "Grokking Without Understanding" Result

Linear estimators trained on linear tasks can manifest grokking despite the absence of deep, hierarchical, or compositional structure. In the canonical Gaussian teacher-student model (Levi et al., 2023):

  • Input tt1; teacher weights tt2; student weights tt3.
  • Training proceeds under full gradient flow with or without tt4 weight decay.

The time evolution of both train and test (generalization) loss is analytically tractable:

tt5

where tt6 are empirical data-covariance eigenvalues. The difference in decay rates among tt7 (large modes decay fast, small modes slow) yields a regime where tt8 becomes very small long before tt9, precipitating the grokking effect. Notably, the sharp test-accuracy jump is a threshold artifact with no underlying shift in "understanding" or learned representation. This demonstrates that delayed generalization can be an artifact of measurement rather than an emergent algorithmic phase and can occur in models with no meaningful feature learning (Levi et al., 2023).

3. Predictors and Progress Measures for Grokking

A diverse set of quantitative metrics forecasts the onset and characterizes the mechanism of grokking, extending beyond simple weight norms:

Measure Definition/Role Predictive Alignment with Grokking
Dropout Robustness Curve Test accuracy vs. dropout rate at checkpoints Slope changes signal grokking onset
MC-Dropout Variance Variation of MC test accuracy under dropout Variance peak predicts transition
Embedding Bimodality Cosine/histogram structure in learned embeddings Bimodality emergence ≈ grok point
Activation Sparsity Fraction of inactive neurons in ReLU layers Minima/rises correlate with onset
Absolute Weight Entropy Shannon entropy over Ltrain(t)L_{\text{train}}(t)0 Sharp entropy decrease at grokking
Local Circuit Complexity KL divergence of outputs after weight ablation Drops abruptly as generalization forms

These measures can yield Ltrain(t)L_{\text{train}}(t)1 correlation with grokking time and, when combined in regression, explain over 90% of grokking variance (Salah et al., 15 Jul 2025, Golechha, 2024).

Notably, Ltrain(t)L_{\text{train}}(t)2 weight norms, previously conjectured as the primary driver, are neither necessary nor sufficient for grokking; models may grok with both rising and falling norms, or in weight-norm ranges far from the "goldilocks zone" (Golechha, 2024).

4. Theoretical Frameworks: Phase Transition and Representation Learning

Multiple theoretical frameworks formalize grokking as a phase transition. In continuous and discrete neural systems:

  • Linear networks: Transition is governed by slowest-to-decay covariance modes, with a sharp yet analytic delay scaling as Ltrain(t)L_{\text{train}}(t)3 where Ltrain(t)L_{\text{train}}(t)4 (Levi et al., 2023).
  • Nonlinear and teacher-student models: Both first- and second-order phase transition analogies arise. In some models, the test-error Ltrain(t)L_{\text{train}}(t)5 decays as Ltrain(t)L_{\text{train}}(t)6 near the grokking point Ltrain(t)L_{\text{train}}(t)7 with analytically calculated exponents Ltrain(t)L_{\text{train}}(t)8 (Žunkovič et al., 2022).
  • Feature learning: Adaptive-kernel approaches show that grokking corresponds to an abrupt alignment of feature space with the teacher, analogous to nucleation in a first-order transition, with a mixed/droplet phase in which only some neurons have learned the useful feature (Rubin et al., 2023).
  • Phase diagrams: An effective theory yields four learning phases (comprehension, grokking, memorization, confusion) with transitions separated by hyperparameters such as learning rates and weight decay, and a "Goldilocks zone" where representation learning and grokking can occur (Liu et al., 2022).

These models quantitatively relate grokking delay, order parameter evolution, and generalization to key system parameters, and predict diverging grokking time near phase boundaries.

5. Optimization, Regularization, and Data-Distribution Dependencies

Grokking is highly sensitive to optimizer dynamics, regularization, and data regime:

  • Implicit bias and optimizer structure: AdamW with decoupled Ltrain(t)L_{\text{train}}(t)9 weight decay and anisotropic noise gates generalizing solutions behind a stability ceiling; grokking is then a variance-limited phase transition where gradient variance must accumulate to cross a "spectral gate"—probing the interaction between optimizer noise and landscape curvature (Acharya et al., 16 Mar 2026).
  • Regularization target: Grokking emerges with any regularizer that promotes a property Ltest(t)L_{\text{test}}(t)0 (e.g., sparsity or low rank), not just Ltest(t)L_{\text{test}}(t)1. When models possess solutions with property Ltest(t)L_{\text{test}}(t)2, small nonzero regularization or sufficient depth can delay generalization until Ltest(t)L_{\text{test}}(t)3 is realized (Notsawo et al., 6 Jun 2025).
  • Data distribution and initialization: Inadequate or imbalanced sampling of subcategories can induce a persistent train-test distribution gap, which is necessary for delayed generalization. Grokking disappears if the distribution shift is resolved, regardless of dataset size (Carvalho et al., 3 Feb 2025).
  • Embedding dynamics and bilinear coupling: The embedding layer in transformers and MLPs is a central driver of grokking; rare-token stagnation due to infrequent updates, and bilinear coupling between embeddings and downstream weights, produce saddle points and optimization slowdowns that align with prolonged grokking plateaus (AlquBoj et al., 21 May 2025).

Grokking can thus be eliminated or accelerated by modifying regularization (target or magnitude), data selection or sampling, parameter initialization scale, batch size, or optimizer structure.

6. Conceptual Interpretations and Cautions

The sharp test-accuracy jump is frequently not associated with a qualitative transition from "memorization" to "algorithmic understanding." In linear networks and many practical deep networks, this is an artifact of slow convergence in statistically hard directions, not the sudden emergence of new representations (Levi et al., 2023, Carvalho et al., 3 Feb 2025). In sequence and modular tasks, empirical evidence points to gradual, not abrupt, consolidation of circuit-like reasoning paths (He et al., 14 Jan 2026).

Furthermore, metrics such as weight norms may track grokking empirically, but only more nuanced measures—such as activation sparsity, circuit complexity, and geometry of the embedding space—reliably forecast grokking and provide mechanistic insight (Golechha, 2024, Gu et al., 4 Apr 2025).

7. Empirical and Theoretical Generality, Open Problems

Extensive validation across architectures (linear nets, MLPs, LSTMs, Transformers, ResNets), data regimes (synthetic, real-world), optimization settings (Adam, SGD, SGLD), and regularization schemes demonstrates that grokking is a generic phenomenon, provided distribution shift or phase-separated loss landscapes persist (Levi et al., 2023, Golechha, 2024, Carvalho et al., 3 Feb 2025, AlquBoj et al., 21 May 2025, Notsawo et al., 6 Jun 2025).

Open directions include fine-grained prediction of grokking delay from underlying spectral properties, extension of theoretical results to large-scale and structured models (transformers, convolutional nets), and development of interventions to eliminate undesirable grokking delay or leverage delayed generalization in practice (Gu et al., 4 Apr 2025, Carvalho et al., 3 Feb 2025).


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grokking Analysis.