Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grokfast Algorithm: Accelerating Neural Grokking

Updated 22 February 2026
  • Grokfast Algorithm is an optimization technique that accelerates grokking in overparameterized neural networks by equalizing gradient progress.
  • It leverages spectral decomposition to boost slow, generalization-inducing gradients, achieving significant speedups across tasks.
  • The method employs EMA-based amplification and egalitarian gradient descent to facilitate uniform updates and practical integration with common training regimes.

The Grokfast Algorithm, also known as Egalitarian Gradient Descent (EGD), is a family of optimization modifications designed to dramatically accelerate the “grokking” phenomenon in overparameterized neural networks. Grokking is characterized by an extended delay between the achievement of perfect training accuracy and the abrupt onset of high generalization accuracy, often resulting in significant computational cost and practical inefficiency. The Grokfast methodology addresses this by spectrally decomposing the optimization process and amplifying the slow, generalization-inducing gradient components, or by preconditioning gradients so that all principal directions evolve at equal rates. Robust empirical evidence demonstrates consistent order-of-magnitude speedups in a variety of tasks, including modular arithmetic, parity problems, image recognition, language modeling, and graph learning (Lee et al., 2024, Pasand et al., 6 Oct 2025).

1. Grokking Phenomenon and Plateau Dynamics

Grokking refers to the sharply delayed transition from overfitting to generalization during training. Typically, a model will achieve near-zero training loss rapidly, while validation/test accuracy remains at chance for a protracted plateau, sometimes spanning thousands of training iterations. This delay is especially pronounced in algorithmic tasks such as modular arithmetic and sparse parity, where the model eventually "discovers" a correct hypothesis and test accuracy spikes suddenly (Lee et al., 2024, Pasand et al., 6 Oct 2025).

An analysis of gradient descent dynamics reveals that this plateau arises from highly asymmetric convergence speeds along different spectral modes of the gradient covariance. Fast directions—corresponding to large singular values—are quickly optimized, leading to memorization. Slow directions, governed by small singular values, are responsible for generalization but progress orders of magnitude more slowly, producing the empirical plateau (Pasand et al., 6 Oct 2025).

2. Spectral Decomposition of Gradient Trajectories

The Grokfast approach conceptualizes the sequence of parameter gradients {gt}\{g_t\} as a discrete-time random signal. By applying spectral decomposition, the gradient signal is separated into:

  • High-frequency (fast) components: Drive rapid fitting and memorization.
  • Low-frequency (slow) components: Underpin the slower emergence of generalization.

Formally, for each parameter coordinate, gt=gt(slow)+gt(fast)g_t = g_t^{(\text{slow})} + g_t^{(\text{fast})}, where gt(slow)=(hg)tg_t^{(\text{slow})} = (h * g)_t for some low-pass filter hh. Boosting gt(slow)g_t^{(\text{slow})} accelerates movement along generalization-relevant directions, effectively shrinking the grokking plateau (Lee et al., 2024).

Alternatively, a matrix-level spectral normalization (as in EGD) aligns all singular directions of the gradient update, ensuring uniform progress across modes (Pasand et al., 6 Oct 2025).

3. Algorithmic Modifications and Implementation

Two main strategies have been established for operationalizing Grokfast:

A. Spectral Gradient Amplification

The canonical Grokfast update modifies the parameter gradients as:

g^t=gt+λgt(slow)\hat{g}_t = g_t + \lambda \, g_t^{(\text{slow})}

where λ>0\lambda > 0 is an amplification factor. Two filtering mechanisms are considered:

  • Moving Average (MA): A windowed average of past ww gradients per parameter.
  • Exponential Moving Average (EMA): A momentum-style recursion μt=αμt1+(1α)gt\mu_t = \alpha \mu_{t-1} + (1-\alpha) g_t, with the final update g^t=gt+λμt\hat{g}_t = g_t + \lambda \mu_t.

The EMA variant is preferred for its O(model size)O(\text{model size}) memory usage and ease of integration. It requires only a single additional buffer per parameter (Lee et al., 2024).

B. Egalitarian Gradient Descent (EGD)

EGD whitens the empirical gradient such that every principal direction evolves at identical speed. For a gradient matrix GtG_t at time tt:

G~t=(GtGt)1/2Gt\tilde G_t = (G_t G_t^\top)^{-1/2} G_t

When GtG_t has SVD UtStVtU_t S_t V_t^\top, the transformation yields UtVtU_t V_t^\top, making all singular values unity so that the update moves equally along all axes. Practically, this can be implemented with exact SVD (per layer) or approximate covariance tracking for scalability (Pasand et al., 6 Oct 2025).

Pseudocode Summary (EMA variant)

1
2
3
4
5
6
7
8
9
10
mu = { name: torch.zeros_like(param) for each param }
λ, α = <chosen amplification>, <chosen momentum>
for each step t:
    loss.backward()  # g_t in param.grad
    for name, param in model.named_parameters():
        if param.grad is None: continue
        mu[name] = α*mu[name] + (1-α)*param.grad.data
        param.grad.data = param.grad.data + λ * mu[name]
    optimizer.step()
    model.zero_grad()
(Lee et al., 2024)

4. Empirical Results and Performance Gains

Extensive experiments validate Grokfast and EGD across tasks:

Task Model Baseline Grok Delay Grokfast/EGD Delay Speedup
modular × mod 97 2-layer TF ~40,000 steps ~800 steps ×50
MNIST subsampled 3-layer MLP ~44,000 steps ~2,000 steps ×22
Sparse Parity (n,k) 2-layer MLP 100–200 epochs 3–8 epochs ×20–100

EGD and Grokfast consistently reduce plateau lengths by 20×–100× without loss of final accuracy. For instance, in modular addition, vanilla SGD plateaus for several hundred epochs before validation accuracy rises, while EGD triggers grokking within 5–10 epochs (Lee et al., 2024, Pasand et al., 6 Oct 2025).

Fast convergence is also reported on language (IMDb-LSTM), graph (QM9-GCNN), and other domains, with early increases in validation accuracy relative to standard training.

5. Hyperparameters, Trade-offs, and Integration

Key parameters for Grokfast-EMA are the amplification λ\lambda (0.1λ50.1 \leq \lambda \leq 5) and EMA momentum α\alpha (0.8α0.990.8 \leq \alpha \leq 0.99). For the MA variant, a window size ww in [50, 200] is typical. EGD is hyperparameter-free, relying only on standard optimizer step size.

Integration:

  • Both approaches are compatible with weight decay (WD ≈ 0.005–0.02 enhances effect), Adam, momentum, and dropout.
  • A two-stage regime—conventional training up to overfitting, followed by Grokfast—can further shorten time-to-generalization.
  • The unfiltered high-frequency component should not be dropped, as both memorization and generalization modes are necessary.

Trade-offs:

  • MA filtering introduces O(w×model size)O(w \times \text{model size}) memory; EMA and EGD are O(model size)O(\text{model size}) for practical low-rank approximations.
  • Excessively large λ\lambda or inappropriate filter bandwidth (α1\alpha \to 1) destabilizes training.
  • Full SVD in EGD presents computational overhead in wide layers; mitigations include diagonal/block approximations, low-rank covariance, or randomized SVD.

6. Theoretical Guarantees and Limitations

For linear/quadratic objectives, Grokfast and EGD collapse the spectral condition number dependency. In classic SGD, convergence along the slowest mode requires O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) steps (κ\kappa is the condition number); under EGD, all directions contract as (1η)k(1-\eta)^k, making plateau length O(log(1/ϵ))O(\log(1/\epsilon)), independent of spectrum (Pasand et al., 6 Oct 2025).

The methodology extends to nonlinear networks under mild smoothness assumptions, layer-wise. In practical deep learning, EGD’s efficacy is robust for layers with well-defined principal axes but may require block-wise or approximate updates in very large models. Computational cost may limit exact EGD on transformer-scale layers, although low-rank decompositions offer mitigation.

7. Extensions, Applications, and Practitioner Guidance

Grokfast and EGD are applicable to diverse architectures and tasks, demonstrating broad empirical generality. Further developments include streaming principal component analysis for efficient spectral tracking, integration with adaptive gradient methods, and adaptation for transformer models with structural approximations.

Practitioner advice:

  • Implement Grokfast by adding a one-line gradient modification after backward pass.
  • Calibrate λ\lambda and α\alpha via small exploratory runs, monitoring early validation accuracy gains.
  • Combine with moderate weight decay for maximal effect.
  • Monitor for destabilization if filter/hyperparameters depart from recommended regimes.
  • Employ SVD approximations (e.g., low-rank, block-diagonal) for scalable EGD in large-scale settings.

Grokfast and EGD make empirically elusive sudden generalization practical within everyday training budgets, offering significant speed and resource savings for research and production environments (Lee et al., 2024, Pasand et al., 6 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grokfast Algorithm.