Papers
Topics
Authors
Recent
Search
2000 character limit reached

End-to-End Grokking Dynamics

Updated 3 February 2026
  • The paper demonstrates that grokking is characterized by a delayed jump in test accuracy followed by anti-grokking collapse, uniquely marked by shifts in the heavy-tailed spectral metric α.
  • The methodology reveals that analyzing singular value spectra and embedding uniformity effectively distinguishes the phases from underfitting to overfitting across diverse model types.
  • Controlled interventions like weight decay, learning rate adjustments, and data augmentation offer actionable strategies to modulate grokking and enhance reliable generalization.

Grokking refers to the delayed, often abrupt improvement in test accuracy that occurs long after a model has achieved perfect or near-perfect training accuracy, with recent literature extending this phenomenon to encompass eventual generalization collapse after overfitting—so-called "anti-grokking". End-to-end experimental and theoretical studies over the past several years have established grokking as a robust, cross-domain phenomenon, appearing in deep neural networks, linear models, Gaussian processes, Bayesian neural networks, and even in models regularized toward properties other than weight norm. Progress in this area has also systematized how to diagnose, predict, and manipulate grokking with principled spectral, information-theoretic, and mechanistic metrics, as well as with programmatic data interventions.

1. Phases of Grokking Dynamics and Anti-Grokking Collapse

Empirical studies consistently delineate three distinct phases in standard grokking workflows on algorithmic and real-world datasets (Prakash et al., 4 Jun 2025):

  1. Pre-grokking (Underfitting): After a rapid rise, training accuracy saturates near 100%, while test accuracy stays near chance. Heavy-tailed metrics (e.g., HTSR exponent α\alpha) indicate layers with random-like singular spectra (α5\alpha \gtrsim 5), transitioning to weak correlation as α\alpha approaches $2$.
  2. Grokking (Delayed Generalization): At a characteristic critical training duration, test accuracy jumps from chance to near-perfect, while training accuracy remains untouched. Spectral metrics reveal all important layers converging to the "fat-tailed" regime (2α52 \lesssim \alpha \lesssim 5), signifying robust, generalizing structure.
  3. Anti-grokking (Generalization Collapse): After extended training (order 10710^7 steps for zero weight decay), test accuracy collapses while training accuracy persists at 100%. This new phase is marked by spectral exponents α<2\alpha < 2 and the emergence of correlation traps (outlier singular values in randomized layer weight matrices), reflecting severe overfitting and atypical weight correlations.

Contrary to prior benchmarks and progress measures, only the heavy-tailed self-regularization (HTSR) metric α\alpha distinguishes all three phases, particularly detecting the anti-grokking collapse; alternative metrics (activation sparsity, entropy, circuit complexity, 2\ell^2 norm) fail to predict late-stage loss of generalization (Prakash et al., 4 Jun 2025).

Phase Progression in MLP on MNIST ($1$k-sample, zero weight decay):

Step 10210^2 10510^5 10610^6 10710^7
Train Acc 100% 100% 100% 100%
Test Acc 10% 10% 95% 50% (Collapse in anti-grokking)
HTSR α\alpha (avg) 4.0 ± 0.6 2.9 ± 0.2 2.9 ± 0.2 1.1 ± 0.3

2. Core Mechanisms: Spectral, Algorithmic, and Geometric Perspectives

Multiple independent research programs converge on the following mechanisms:

  • Spectral Self-Regularization: The transition into and out of generalization is accompanied by measurable changes in the singular value spectrum of learned weight matrices. The α\alpha exponent, obtained by fitting a power-law to the right tail of the empirical spectral density, uniquely delineates the three phases. The anti-grokking collapse is preceded by α<2\alpha < 2 and a surge in "correlation traps"—outlier eigenvalues in the elementwise-randomized layer weights, as defined by exceedance over the Marchenko–Pastur bulk edge and KS-statistics on the empirical vs. MP distributions (Prakash et al., 4 Jun 2025).
  • Feature Uniformity and Embedding Geometry: The transition to generalization can be predicted by measures of embedding uniformity. Specifically, the Main Embedding Difference (MED), which tracks the mean pairwise norm of adjacent token embeddings, drops in lockstep with the test loss at the grokking transition. Minimizing the weight norm under perfect training fit forces embeddings into uniform configurations within equivalence classes, and generalization only emerges once the test set is adequately "covered" by uniform embeddings (Gu et al., 4 Apr 2025).
  • Circuit Emergence and Cleanup: Mechanistic reverse engineering of modular addition and related algorithmic tasks reveals grokking as the process of gradual feature amplification (e.g., formation of DFT/Fourier circuits), with the test "jump" coinciding with the rapid removal of memorizing components by weight decay (Nanda et al., 2023).
  • Spline Region Migration and Local Complexity: In DNNs, local complexity—the number of affine spline partition regions near the data—first peaks as the network memorizes. As training continues, these regions migrate away from training points and concentrate near the decision boundary, enabling a robust generalization partition and delayed robustness to adversarial perturbations. The double-descent of local complexity quantitatively marks the onset of grokking (Humayun et al., 2024).

3. Universality and Model-Agnostic Extensions

Grokking is not unique to deep neural networks. It is exhibited in:

  • Overparameterized Linear and Ridge Regression: Analytic and empirical studies provide closed-form predictions for the grokking time as a function of sample size nn, model size mm, weight decay λ\lambda, and initialization variance. The key insight is the separation between the decay of the in-training-space components (fast, driven by data) and out-of-training-space components (slow, driven by regularization), which explains the grokking delay. Explicit formulas quantify tgrokt_{\text{grok}} and show that it increases as 1/λ1/\lambda (Xu et al., 27 Jan 2026, Levi et al., 2023).
  • Gaussian Processes, Bayesian Neural Networks, and Alternative Regularizers: Grokking manifests in GP regression and classification, BNNs, and models with explicit 1\ell_1 (sparsity) or nuclear-norm (low-rank) regularization. The phenomenon is preserved whenever the optimization landscape contains high-complexity solutions that minimize training error more easily than their low-complexity (but well-generalizing) counterparts, and an explicit or implicit regularizer gradually guides the model toward the latter over a long timescale (Miller et al., 2023, Notsawo et al., 6 Jun 2025).
  • Depth and Overparameterization: Deeper networks can exhibit "ungrokking", i.e., they generalize without explicit regularization, through implicit bias, but the grokking delay persists and is driven by architecture-specific factors (Notsawo et al., 6 Jun 2025).
  • Data Selection and Inducing Grokking: Data coherence and spurious (irrelevant) feature augmentation can amplify or induce grokking, aligning with a general "complexity + error" minimization principle: initial memorization is favored in high-complexity directions, while eventual regularization gradually enables general solutions (Notsawo et al., 6 Jun 2025, Miller et al., 2023).

4. Quantitative Progress Measures and Interpretability

A range of progress measures has been systematically evaluated for their ability to diagnose and predict grokking phases:

  • HTSR α\alpha (Heavy-Tail Exponent): Singular value spectra of layer weights; α2\alpha \approx 2 signifies optimal generalization, with α<2\alpha < 2 warning of overfitting-induced collapse (Prakash et al., 4 Jun 2025).
  • Correlation Traps: Detected via empirical spectral density analysis of randomized weight matrices and validated with KS-tests. Present exclusively in anti-grokking (Prakash et al., 4 Jun 2025).
  • Main Embedding Difference (MED): Measures cyclic embedding uniformity; tracks the test loss with high fidelity (Gu et al., 4 Apr 2025).
  • Local Complexity: Counts local partition regions; double-descent signals impending grokking (Humayun et al., 2024).
  • Information-theoretic Diagnostics: Includes Perturbed Mutual Information, Perturbed Entropy, and predictive differences (MID/ED), which display sharp changes precisely at the grokking transition, outperforming classical weight-norm metrics for detection and early-warning (Tan et al., 2023).
  • Group Symmetry Metrics: For tasks with algebraic structure (e.g., modular addition), learned invariance (e.g., commutativity) can be tracked and found to be coincident with, or even determinative of, the grokking jump (Tan et al., 2023).
Metric Detects anti-grokking Predicts grokking onset Theoretical threshold Comments
HTSR α\alpha Yes Yes (at α=2\alpha=2) Universal (α2\alpha \approx 2) Unique in delineating all phases
Activation sparsity No Inflection only No Smoothly increasing; continues after breakdown
Weight entropy No No unique signature No Smooth monotonic trends
Local complexity No Double-descent signals No Predicts phase transition
Info-theoretic N/A Yes (precise; zero lag) No Coincide exactly with accuracy jump

5. Practical Manipulation: Controlling, Amplifying, and Eliminating Grokking

Experimental and theoretical work provides concrete interventions to manipulate grokking:

  • Weight Decay and Regularization: Increasing weight decay (λ\lambda) or analogous regularization strength (e.g., 1\ell_1, nuclear norm) reduces or eliminates the grokking delay; as λ0\lambda\to 0, the grokking gap grows (e.g., tgrok1/λt_{\mathrm{grok}}\propto 1/\lambda) (Xu et al., 27 Jan 2026, Notsawo et al., 6 Jun 2025). In transformers, decoder weight decay or dropout eliminates grokking in favor of immediate generalization (Liu et al., 2022).
  • Learning Rate and Initialization: Learning rate scales both pre-grokking and generalization times similarly; initialization variance appears logarithmically in grokking time—large values amplify the phenomenon (Xu et al., 27 Jan 2026).
  • Batch Size and Regularization: Smaller batch size (higher gradient noise) expands the grokking region by slowing generalization (Liu et al., 2022).
  • Data Augmentation and Selection: Augmenting datasets with structured, even factually incorrect, synthetic relational data in multi-hop reasoning tasks massively raises the inference-to-atomic-fact ratio above the critical grokking threshold, forcing transformer models to discover relational reasoning circuits rather than memorizing idiosyncratic tuples. OOD accuracy jumps from sub-60% to above 95% post grokking (Abramov et al., 29 Apr 2025). Amplifying grokking can also be achieved via data coherence (e.g., selection by leverage scores in low-rank or sparse tasks), which increases delay and required sample size (Notsawo et al., 6 Jun 2025).
  • Exploiting Symmetry and Perturbation: Explicitly regularizing toward known structural symmetries (e.g., commutativity), injecting input perturbations (Gaussian noise), or using information-theoretic tracking can both accelerate the onset of grokking and provide early warning of potential failure to generalize (Tan et al., 2023).

6. Theoretical Frameworks and Limiting Cases

  • Analytic and Rigorous Bounds for Grokking Delay: In overparameterized linear models, precise formulas for pre- and post-generalization epochs are available, and the difference can be arbitrarily tuned via regularization (Xu et al., 27 Jan 2026, Levi et al., 2023). A plausible implication is that architectural or optimization settings that separate the effective "complexity" and "error" minimization regimes in the optimization landscape universally admit grokking-type delays.
  • General GD+Regularization Theory: For any learning system minimizing g(x)+βh(x)g(x)+\beta h(x), where gg is error and hh is a complexity/regularization objective, if gg alone is minimized rapidly but hh is minimized only under β1\beta\ll 1 on a slower timescale, then two-phase (memorization/generalization) dynamics, i.e., grokking, are generic (Notsawo et al., 6 Jun 2025).
  • Failure Modes for Proxy Metrics: In regimes regularized toward sparsity or low-rank, classical weight-norms increase post-fit, and thus cannot serve as universal proxies for generalization. Instead, progress in the principal property (1\ell_1, nuclear norm, or analogous) must be tracked (Notsawo et al., 6 Jun 2025).

7. Significance and Open Directions

End-to-end grokking results collectively establish that delayed generalization is not an isolated artifact of particular architectures, tasks, or losses, but a generic outcome of dynamics where rapid error minimization in high-capacity models precedes the slower, structured compression required for robust generalization. The observed anti-grokking collapse highlights the risk of indefinite training under vanishing regularization and the necessity of suitable spectral monitoring for overfitting diagnostics (Prakash et al., 4 Jun 2025).

A plausible implication is that understanding and steering grokking-type behaviors will be essential for reliable deployment of overparameterized models in real-world and safety-critical contexts. Open research areas include identifying universal progress measures for tasks with implicit regularization, constitutive roles of data geometry and model capacity, connections to double-descent risk, and harnessing the compressive regime (as evidenced by linear ζ\zetaLL tradeoffs in steady training (Manning-Coe et al., 3 Feb 2025)) for principled model compression.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to End-to-End Grokking Results.