Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grokfast Algorithm: Accelerating Neural Grokking

Updated 22 February 2026
  • Grokfast Algorithm is an optimization technique that accelerates grokking in overparameterized neural networks by equalizing gradient progress.
  • It leverages spectral decomposition to boost slow, generalization-inducing gradients, achieving significant speedups across tasks.
  • The method employs EMA-based amplification and egalitarian gradient descent to facilitate uniform updates and practical integration with common training regimes.

The Grokfast Algorithm, also known as Egalitarian Gradient Descent (EGD), is a family of optimization modifications designed to dramatically accelerate the “grokking” phenomenon in overparameterized neural networks. Grokking is characterized by an extended delay between the achievement of perfect training accuracy and the abrupt onset of high generalization accuracy, often resulting in significant computational cost and practical inefficiency. The Grokfast methodology addresses this by spectrally decomposing the optimization process and amplifying the slow, generalization-inducing gradient components, or by preconditioning gradients so that all principal directions evolve at equal rates. Robust empirical evidence demonstrates consistent order-of-magnitude speedups in a variety of tasks, including modular arithmetic, parity problems, image recognition, language modeling, and graph learning (Lee et al., 2024, Pasand et al., 6 Oct 2025).

1. Grokking Phenomenon and Plateau Dynamics

Grokking refers to the sharply delayed transition from overfitting to generalization during training. Typically, a model will achieve near-zero training loss rapidly, while validation/test accuracy remains at chance for a protracted plateau, sometimes spanning thousands of training iterations. This delay is especially pronounced in algorithmic tasks such as modular arithmetic and sparse parity, where the model eventually "discovers" a correct hypothesis and test accuracy spikes suddenly (Lee et al., 2024, Pasand et al., 6 Oct 2025).

An analysis of gradient descent dynamics reveals that this plateau arises from highly asymmetric convergence speeds along different spectral modes of the gradient covariance. Fast directions—corresponding to large singular values—are quickly optimized, leading to memorization. Slow directions, governed by small singular values, are responsible for generalization but progress orders of magnitude more slowly, producing the empirical plateau (Pasand et al., 6 Oct 2025).

2. Spectral Decomposition of Gradient Trajectories

The Grokfast approach conceptualizes the sequence of parameter gradients {gt}\{g_t\} as a discrete-time random signal. By applying spectral decomposition, the gradient signal is separated into:

  • High-frequency (fast) components: Drive rapid fitting and memorization.
  • Low-frequency (slow) components: Underpin the slower emergence of generalization.

Formally, for each parameter coordinate, gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}, where gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t for some low-pass filter hh. Boosting gt(slow)g_t^{(\text{slow})} accelerates movement along generalization-relevant directions, effectively shrinking the grokking plateau (Lee et al., 2024).

Alternatively, a matrix-level spectral normalization (as in EGD) aligns all singular directions of the gradient update, ensuring uniform progress across modes (Pasand et al., 6 Oct 2025).

3. Algorithmic Modifications and Implementation

Two main strategies have been established for operationalizing Grokfast:

A. Spectral Gradient Amplification

The canonical Grokfast update modifies the parameter gradients as:

g^t=gt+λgt(slow)\hat{g}_t = g_t + \lambda \, g_t^{(\text{slow})}

where λ>0\lambda > 0 is an amplification factor. Two filtering mechanisms are considered:

  • Moving Average (MA): A windowed average of past ww gradients per parameter.
  • Exponential Moving Average (EMA): A momentum-style recursion μt=αμt1+(1α)gt\mu_t = \alpha \mu_{t-1} + (1-\alpha) g_t, with the final update g^t=gt+λμt\hat{g}_t = g_t + \lambda \mu_t.

The EMA variant is preferred for its gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}0 memory usage and ease of integration. It requires only a single additional buffer per parameter (Lee et al., 2024).

B. Egalitarian Gradient Descent (EGD)

EGD whitens the empirical gradient such that every principal direction evolves at identical speed. For a gradient matrix gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}1 at time gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}2:

gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}3

When gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}4 has SVD gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}5, the transformation yields gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}6, making all singular values unity so that the update moves equally along all axes. Practically, this can be implemented with exact SVD (per layer) or approximate covariance tracking for scalability (Pasand et al., 6 Oct 2025).

Pseudocode Summary (EMA variant)

hh2 (Lee et al., 2024)

4. Empirical Results and Performance Gains

Extensive experiments validate Grokfast and EGD across tasks:

Task Model Baseline Grok Delay Grokfast/EGD Delay Speedup
modular × mod 97 2-layer TF ~40,000 steps ~800 steps ×50
MNIST subsampled 3-layer MLP ~44,000 steps ~2,000 steps ×22
Sparse Parity (n,k) 2-layer MLP 100–200 epochs 3–8 epochs ×20–100

EGD and Grokfast consistently reduce plateau lengths by 20×–100× without loss of final accuracy. For instance, in modular addition, vanilla SGD plateaus for several hundred epochs before validation accuracy rises, while EGD triggers grokking within 5–10 epochs (Lee et al., 2024, Pasand et al., 6 Oct 2025).

Fast convergence is also reported on language (IMDb-LSTM), graph (QM9-GCNN), and other domains, with early increases in validation accuracy relative to standard training.

5. Hyperparameters, Trade-offs, and Integration

Key parameters for Grokfast-EMA are the amplification gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}7 (gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}8) and EMA momentum gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}9 (gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t0). For the MA variant, a window size gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t1 in [50, 200] is typical. EGD is hyperparameter-free, relying only on standard optimizer step size.

Integration:

  • Both approaches are compatible with weight decay (WD ≈ 0.005–0.02 enhances effect), Adam, momentum, and dropout.
  • A two-stage regime—conventional training up to overfitting, followed by Grokfast—can further shorten time-to-generalization.
  • The unfiltered high-frequency component should not be dropped, as both memorization and generalization modes are necessary.

Trade-offs:

  • MA filtering introduces gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t2 memory; EMA and EGD are gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t3 for practical low-rank approximations.
  • Excessively large gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t4 or inappropriate filter bandwidth (gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t5) destabilizes training.
  • Full SVD in EGD presents computational overhead in wide layers; mitigations include diagonal/block approximations, low-rank covariance, or randomized SVD.

6. Theoretical Guarantees and Limitations

For linear/quadratic objectives, Grokfast and EGD collapse the spectral condition number dependency. In classic SGD, convergence along the slowest mode requires gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t6 steps (gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t7 is the condition number); under EGD, all directions contract as gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t8, making plateau length gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t9, independent of spectrum (Pasand et al., 6 Oct 2025).

The methodology extends to nonlinear networks under mild smoothness assumptions, layer-wise. In practical deep learning, EGD’s efficacy is robust for layers with well-defined principal axes but may require block-wise or approximate updates in very large models. Computational cost may limit exact EGD on transformer-scale layers, although low-rank decompositions offer mitigation.

7. Extensions, Applications, and Practitioner Guidance

Grokfast and EGD are applicable to diverse architectures and tasks, demonstrating broad empirical generality. Further developments include streaming principal component analysis for efficient spectral tracking, integration with adaptive gradient methods, and adaptation for transformer models with structural approximations.

Practitioner advice:

  • Implement Grokfast by adding a one-line gradient modification after backward pass.
  • Calibrate hh0 and hh1 via small exploratory runs, monitoring early validation accuracy gains.
  • Combine with moderate weight decay for maximal effect.
  • Monitor for destabilization if filter/hyperparameters depart from recommended regimes.
  • Employ SVD approximations (e.g., low-rank, block-diagonal) for scalable EGD in large-scale settings.

Grokfast and EGD make empirically elusive sudden generalization practical within everyday training budgets, offering significant speed and resource savings for research and production environments (Lee et al., 2024, Pasand et al., 6 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grokfast Algorithm.