Grokking Analysis in Neural Networks
- Grokking analysis is defined by rapid memorization (t1) with near-perfect training accuracy and a prolonged plateau in test performance until a sharp transition occurs at t2.
- It examines the interplay of optimizer dynamics, regularization, and data distribution, demonstrating phase-transition behaviors in both linear and nonlinear models.
- Empirical measures like dropout robustness curves and embedding bimodality provide actionable metrics to predict grokking onset and guide model optimization.
Grokking is the phenomenon in machine learning where a model achieves near-perfect accuracy on the training set rapidly (memorization), but requires many additional optimization steps before exhibiting a sudden, sharp transition to high generalization performance (test accuracy). This effect has been observed across a variety of tasks, architectures, and optimization regimes, and challenges classical assumptions underlying convergence and generalization in neural network learning. While originally identified in algorithmic and modular arithmetic tasks, recent studies demonstrate that grokking arises in both synthetic and real-world settings, and can occur in linear as well as deep nonlinear models. Modern research elucidates grokking as an interplay among optimizer dynamics, implicit and explicit regularization, model architecture, data distribution, and phase-transition–like mechanisms.
1. Core Definitions and Canonical Phenomenology
Grokking is formally defined by the delayed alignment of training and test accuracy under continual training. Given model parameters at optimization step , training loss rapidly approaches zero, while test loss remains elevated (near-chance) for an extended plateau before abruptly dropping (Levi et al., 2023, Carvalho et al., 3 Feb 2025). Precise characterization involves:
- Memorization time : .
- Generalization time : .
- Grokking delay .
The observed signature is a sharp, step-like rise in test accuracy (or collapse in test loss) at , in contrast to the smooth ascent of training accuracy at 0. This dynamic appears robust across datasets, ranging from modular addition to image and sentiment classification tasks (Golechha, 2024, Carvalho et al., 3 Feb 2025).
2. Mechanistic Models: Linear Solvable Cases and the "Grokking Without Understanding" Result
Linear estimators trained on linear tasks can manifest grokking despite the absence of deep, hierarchical, or compositional structure. In the canonical Gaussian teacher-student model (Levi et al., 2023):
- Input 1; teacher weights 2; student weights 3.
- Training proceeds under full gradient flow with or without 4 weight decay.
The time evolution of both train and test (generalization) loss is analytically tractable:
5
where 6 are empirical data-covariance eigenvalues. The difference in decay rates among 7 (large modes decay fast, small modes slow) yields a regime where 8 becomes very small long before 9, precipitating the grokking effect. Notably, the sharp test-accuracy jump is a threshold artifact with no underlying shift in "understanding" or learned representation. This demonstrates that delayed generalization can be an artifact of measurement rather than an emergent algorithmic phase and can occur in models with no meaningful feature learning (Levi et al., 2023).
3. Predictors and Progress Measures for Grokking
A diverse set of quantitative metrics forecasts the onset and characterizes the mechanism of grokking, extending beyond simple weight norms:
| Measure | Definition/Role | Predictive Alignment with Grokking |
|---|---|---|
| Dropout Robustness Curve | Test accuracy vs. dropout rate at checkpoints | Slope changes signal grokking onset |
| MC-Dropout Variance | Variation of MC test accuracy under dropout | Variance peak predicts transition |
| Embedding Bimodality | Cosine/histogram structure in learned embeddings | Bimodality emergence ≈ grok point |
| Activation Sparsity | Fraction of inactive neurons in ReLU layers | Minima/rises correlate with onset |
| Absolute Weight Entropy | Shannon entropy over 0 | Sharp entropy decrease at grokking |
| Local Circuit Complexity | KL divergence of outputs after weight ablation | Drops abruptly as generalization forms |
These measures can yield 1 correlation with grokking time and, when combined in regression, explain over 90% of grokking variance (Salah et al., 15 Jul 2025, Golechha, 2024).
Notably, 2 weight norms, previously conjectured as the primary driver, are neither necessary nor sufficient for grokking; models may grok with both rising and falling norms, or in weight-norm ranges far from the "goldilocks zone" (Golechha, 2024).
4. Theoretical Frameworks: Phase Transition and Representation Learning
Multiple theoretical frameworks formalize grokking as a phase transition. In continuous and discrete neural systems:
- Linear networks: Transition is governed by slowest-to-decay covariance modes, with a sharp yet analytic delay scaling as 3 where 4 (Levi et al., 2023).
- Nonlinear and teacher-student models: Both first- and second-order phase transition analogies arise. In some models, the test-error 5 decays as 6 near the grokking point 7 with analytically calculated exponents 8 (Žunkovič et al., 2022).
- Feature learning: Adaptive-kernel approaches show that grokking corresponds to an abrupt alignment of feature space with the teacher, analogous to nucleation in a first-order transition, with a mixed/droplet phase in which only some neurons have learned the useful feature (Rubin et al., 2023).
- Phase diagrams: An effective theory yields four learning phases (comprehension, grokking, memorization, confusion) with transitions separated by hyperparameters such as learning rates and weight decay, and a "Goldilocks zone" where representation learning and grokking can occur (Liu et al., 2022).
These models quantitatively relate grokking delay, order parameter evolution, and generalization to key system parameters, and predict diverging grokking time near phase boundaries.
5. Optimization, Regularization, and Data-Distribution Dependencies
Grokking is highly sensitive to optimizer dynamics, regularization, and data regime:
- Implicit bias and optimizer structure: AdamW with decoupled 9 weight decay and anisotropic noise gates generalizing solutions behind a stability ceiling; grokking is then a variance-limited phase transition where gradient variance must accumulate to cross a "spectral gate"—probing the interaction between optimizer noise and landscape curvature (Acharya et al., 16 Mar 2026).
- Regularization target: Grokking emerges with any regularizer that promotes a property 0 (e.g., sparsity or low rank), not just 1. When models possess solutions with property 2, small nonzero regularization or sufficient depth can delay generalization until 3 is realized (Notsawo et al., 6 Jun 2025).
- Data distribution and initialization: Inadequate or imbalanced sampling of subcategories can induce a persistent train-test distribution gap, which is necessary for delayed generalization. Grokking disappears if the distribution shift is resolved, regardless of dataset size (Carvalho et al., 3 Feb 2025).
- Embedding dynamics and bilinear coupling: The embedding layer in transformers and MLPs is a central driver of grokking; rare-token stagnation due to infrequent updates, and bilinear coupling between embeddings and downstream weights, produce saddle points and optimization slowdowns that align with prolonged grokking plateaus (AlquBoj et al., 21 May 2025).
Grokking can thus be eliminated or accelerated by modifying regularization (target or magnitude), data selection or sampling, parameter initialization scale, batch size, or optimizer structure.
6. Conceptual Interpretations and Cautions
The sharp test-accuracy jump is frequently not associated with a qualitative transition from "memorization" to "algorithmic understanding." In linear networks and many practical deep networks, this is an artifact of slow convergence in statistically hard directions, not the sudden emergence of new representations (Levi et al., 2023, Carvalho et al., 3 Feb 2025). In sequence and modular tasks, empirical evidence points to gradual, not abrupt, consolidation of circuit-like reasoning paths (He et al., 14 Jan 2026).
Furthermore, metrics such as weight norms may track grokking empirically, but only more nuanced measures—such as activation sparsity, circuit complexity, and geometry of the embedding space—reliably forecast grokking and provide mechanistic insight (Golechha, 2024, Gu et al., 4 Apr 2025).
7. Empirical and Theoretical Generality, Open Problems
Extensive validation across architectures (linear nets, MLPs, LSTMs, Transformers, ResNets), data regimes (synthetic, real-world), optimization settings (Adam, SGD, SGLD), and regularization schemes demonstrates that grokking is a generic phenomenon, provided distribution shift or phase-separated loss landscapes persist (Levi et al., 2023, Golechha, 2024, Carvalho et al., 3 Feb 2025, AlquBoj et al., 21 May 2025, Notsawo et al., 6 Jun 2025).
Open directions include fine-grained prediction of grokking delay from underlying spectral properties, extension of theoretical results to large-scale and structured models (transformers, convolutional nets), and development of interventions to eliminate undesirable grokking delay or leverage delayed generalization in practice (Gu et al., 4 Apr 2025, Carvalho et al., 3 Feb 2025).
Key References:
- "Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding" (Levi et al., 2023)
- "Tracing the Path to Grokking: Embeddings, Dropout, and Network Activation" (Salah et al., 15 Jul 2025)
- "Progress Measures for Grokking on Real-world Tasks" (Golechha, 2024)
- "Grokking Explained: A Statistical Phenomenon" (Carvalho et al., 3 Feb 2025)
- "Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking" (Xu, 18 Feb 2026)
- "Grokking Beyond the Euclidean Norm of Model Parameters" (Notsawo et al., 6 Jun 2025)