Grokfast: Accelerated Grokking by Amplifying Slow Gradients (2405.20233v2)

Published 30 May 2024 in cs.LG and cs.AI

Abstract: One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than $\times 50$ with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at https://github.com/ironjr/grokfast.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Grokfast, a method that amplifies slow gradient components to significantly expedite the grokking process.
It employs gradient spectral decomposition and low-pass filtering to isolate and enhance slow-varying gradients during training.
Empirical results on datasets like algorithmic tasks, MNIST, QM9, and IMDb demonstrate substantial reductions in training time and improved generalization.

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

The paper "Grokfast: Accelerated Grokking by Amplifying Slow Gradients" by Jaerin Lee et al. addresses a significant phenomenon in machine learning known as grokking. Grokking involves instances where models, having initially overfitted to training data, undergo delayed but sudden generalization after extensive additional training iterations. The authors propose a method to expedite this generalization process, leveraging gradient decomposition and spectral analysis.

Background and Motivation

The grokking phenomenon was first observed in training scenarios involving a two-layer Transformer using algorithmic datasets, such as modular arithmetic. Despite achieving near-perfect training accuracy early on, the model did not generalize well to unseen data until much later in the training process. Existing theories have related grokking to the double descent phenomenon but have not fully characterized its mechanisms.

Given the computational cost associated with grokking, the primary motivation of this work is to bolster the practicality of models experiencing grokking by accelerating the generalization phase. The proposed method extends the utility of grokking-related models under resource constraints prevalent among machine learning practitioners.

Proposed Method: Gradient Spectral Decomposition

The authors introduce a novel approach by treating the series of gradients during training as a stochastic signal. This method spectrally decomposes parameter trajectories into fast-varying (overfitting) and slow-varying (generalization-inducing) components. The core hypothesis is that amplifying the slow-varying gradient components can expedite the grokking.

The primary algorithm, termed Grokfast, integrates the following steps:

Gradient Filtering: The gradients are processed through a low-pass filter, isolating the slow component.
Gradient Amplification: The slow components are amplified and added back to the original gradients before being fed into the optimizer.
Optimizer Application: This modified gradient is then applied using standard optimization algorithms (e.g., SGD or Adam).

Empirical Validation

The authors rigorously validate their hypothesis across various tasks:

Algorithmic Data (Modular Multiplication): Utilizing a Transformer model, Grokfast demonstrated a reduction in training iterations (achieving 95% validation accuracy) by approximately $\times 50$ compared to the baseline.
MNIST Classification: Applied to a three-layer MLP, Grokfast reduced the grokking delay by $\times 22.0$ , improving final evaluation accuracy from 89.5% to 91.5%.
QM9 Molecular Dataset: Training a GCNN for polar prediction with Grokfast resulted in both faster and better convergence in validation loss metrics.
IMDb Sentiment Analysis: Training a two-layer LSTM, the Grokfast algorithm provided quicker generalization with enhanced validation performance.

Discussion and Implications

Transience and Parameter Space Dynamics: The authors interpret the grokking as a state transition in parameter space, exploring states* initialized*, overfitted, and generalized. Grokfast effectively shortens the parameter space traversal between the overfitting and generalized states, thus accelerating generalization.

Compatibility with Weight Decay: A synergistic effect was observed when combining Grokfast with weight decay, further accelerating the grokking process. This joint application also resulted in reduced training instability.

Memory Efficiency: An exponential moving average (EMA) filter introduced significant memory efficiency, critical for large models. This adaptation retained the $\times 50$ acceleration benefit within practical computational constraints.

Theoretical Implications and Future Work: The work underlines the utility of frequency domain analyses in understanding and manipulating neural network training dynamics. Future research could explore adaptive filter designs, deeper theoretical investigations into model state transitions, and broader applications across different architectures and datasets.

Conclusion

The research by Jaerin Lee et al. delivers a compelling method to harness the grokking phenomenon more effectively. By amplifying slow gradients, their approach provides significant computational savings, enabling more practical deployment of models that otherwise experience delayed generalization. This contributes valuable insights and tools for both theoretical explorations and practical enhancement of machine learning training regimes.

Related Papers

Tweets

https://twitter.com/LeopolisDream/status/1797746265754935409

https://twitter.com/roydanroy/status/1798021176763228240

https://twitter.com/inductionheads/status/1868997584381264015

https://twitter.com/cognitivecompai/status/1828758984364593267

https://twitter.com/elliotarledge/status/1868184265231679554

https://twitter.com/burny_tech/status/1813346957080826087