Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grokking as the Transition from Lazy to Rich Training Dynamics (2310.06110v3)

Published 9 Oct 2023 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  2. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  3. Unifying grokking and double descent. arXiv preprint arXiv:2303.06173, 2023.
  4. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
  5. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
  6. Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022a.
  7. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022b.
  8. Andrey Gromov. Grokking modular arithmetic. arXiv preprint arXiv:2301.02679, 2023.
  9. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  10. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  11. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  12. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
  13. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR, 2020.
  14. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):2914, 2021.
  15. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  16. Geometric compression of invariant manifolds in neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2021(4):044001, 2021.
  17. Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In International Conference on Machine Learning, pages 8936–8947. PMLR, 2021.
  18. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35:37932–37946, 2022.
  19. From high-dimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. arXiv preprint arXiv:2302.05882, 2023.
  20. Identifying good directions to escape the ntk regime and efficiently learn low-degree plus sparse polynomials. Advances in Neural Information Processing Systems, 35:14568–14581, 2022.
  21. The onset of variance-limited behavior for networks in the lazy and rich regimes. arXiv preprint arXiv:2212.12147, 2022.
  22. Learning time-scales in two-layers neural networks. arXiv preprint arXiv:2303.00055, 2023.
  23. Learning single-index models with shallow neural networks. Advances in Neural Information Processing Systems, 35:9768–9783, 2022.
  24. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
  25. Limitations of lazy training of two-layers neural networks, 2019.
  26. Smoothing the landscape boosts the signal for sgd: Optimal sample complexity for learning single index models, 2023.
  27. Precise learning curves and higher-order scalings for dot-product kernel regression. Advances in Neural Information Processing Systems, 35:4558–4570, 2022.
  28. On-line learning in soft committee machines. Physical Review E, 52(4):4225, 1995.
  29. Andreas Engel. Statistical mechanics of learning. Cambridge University Press, 2001.
  30. Generalisation dynamics of online learning in over-parameterised neural networks. arXiv preprint arXiv:1901.09085, 2019.
  31. Optimization and generalization of shallow neural networks with quadratic activation functions. Advances in Neural Information Processing Systems, 33:13445–13455, 2020.
  32. The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research, 2023.
  33. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
  34. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
  35. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  36. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
  37. Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem. Machine Learning: Science and Technology, 2(3):035029, 2021.
  38. The high-dimensional asymptotics of first order methods with random data. arXiv preprint arXiv:2112.07572, 2021.
  39. On the training dynamics of deep networks with l⁢_⁢2𝑙_2l\_2italic_l _ 2 regularization. Advances in Neural Information Processing Systems, 33:4790–4799, 2020.
  40. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
  41. Theodor Misiakiewicz. Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression. arXiv preprint arXiv:2204.10425, 2022.
  42. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
  43. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tanishq Kumar (6 papers)
  2. Blake Bordelon (27 papers)
  3. Samuel J. Gershman (25 papers)
  4. Cengiz Pehlevan (81 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.