Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Networks Always Grok and Here is Why (2402.15555v2)

Published 23 Feb 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on the local complexity of a DNN's input-output mapping. Our local complexity measures the density of so-called linear regions (aka, spline partition regions) that tile the DNN input space and serves as a utile progress measure for training. We provide the first evidence that, for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp.  639–668. PMLR, 2022.
  2. A spline theory of deep networks. In Proc. ICML, pp.  374–383, 2018.
  3. Batch normalization explained. arXiv preprint arXiv:2209.14778, 2022.
  4. Police: Provably optimal linear constraint enforcement for deep neural networks. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  5. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
  6. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  7. Are all linear regions created equal? In AISTATS, pp.  6573–6590, 2022.
  8. Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia Tools and Applications, 79:12777–12815, 2020.
  9. Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019.
  10. Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In CVPR, pp.  10641–10650, 2022.
  11. Splinecam: Exact visualization and characterization of deep network geometry and decision boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3789–3798, June 2023a.
  12. Provable instance specific robustness via linear constraints. In 2nd AdvML Frontiers Workshop at International Conference on Machine Learning 2023, 2023b.
  13. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
  14. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  15. Test sample accuracy scales with training sample density in neural networks. In Conference on Lifelong Learning Agents, pp.  629–646. PMLR, 2022.
  16. Implicit regularization in over-parameterized neural networks. arXiv preprint arXiv:1903.01997, 2019.
  17. Why robust generalization in deep learning is difficult: Perspective of expressive power. Advances in Neural Information Processing Systems, 35:4370–4384, 2022.
  18. Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022.
  19. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  20. On the number of linear regions of deep neural networks. In NeurIPS, pp.  2924–2932, 2014.
  21. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  22. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
  23. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  24. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  25. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  26. Adversarial robustness through local linearization. Advances in Neural Information Processing Systems, 32, 2019.
  27. On the expressive power of deep neural networks. In ICML, pp.  2847–2854, 2017.
  28. A blessing of dimensionality in membership inference through regularization. In International Conference on Artificial Intelligence and Statistics, pp.  10968–10993. PMLR, 2023.
  29. Handbook of discrete and computational geometry. CRC press, 2017.
  30. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
  31. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
  32. Robustness and generalization. Machine learning, 86:391–423, 2012.
  33. Comparing wiener complexity with eccentric complexity. Discrete Applied Mathematics, 290:7–16, 2021.
  34. Benign overfitting and grokking in relu networks for xor cluster data. arXiv preprint arXiv:2310.02541, 2023.
  35. Max-affine spline insights into deep network pruning. arXiv preprint arXiv:2101.02338, 2021.
Citations (12)

Summary

  • The paper introduces delayed robustness, showing that deep neural networks can enhance adversarial resistance well after initial training.
  • It develops a novel local complexity metric to quantify the migration of linear regions toward decision boundaries during training.
  • Extensive experiments across various architectures and datasets demonstrate that generalization and robustness emerge from similar network partition shifts.

Exploring the Phenomena of Grokking in Deep Neural Networks

Introduction to Grokking

Grokking, a phenomenon characterized by a delayed generalization in deep neural networks (DNNs), has been an area of recent intrigue in machine learning research. Classically, grokking has been observed where a DNN, despite achieving near-zero training error, does not initially generalize well to unseen data. Over time, and without changes to the learning regimen, the model begins to exhibit improved performance on test data, a process referred to as "grokking."

Delayed Robustness: A Novel Form of Grokking

This paper introduces the concept of delayed robustness, expanding the understanding of grokking. Delayed robustness is observed when DNNs exhibit increased robustness to adversarial examples well after initial training phases, marking a notable departure from traditional notions of training dynamics where robustness and generalization are often seen as competing objectives. This phenomenon was identified across various architectures including fully connected networks, convolutional neural networks (CNNs), and ResNets, trained on datasets such as MNIST, CIFAR10, CIFAR100, and Imagenette.

Local Complexity and Its Role in Grokking

Central to understanding and quantifying grokking is the introduction of a novel measure dubbed local complexity (LC), which evaluates the density of linear regions within the DNN's input-output mapping. This measure is posited to be a significant indicator of a network's readiness to "grok." It was observed that during training, the LC undergoes a phase transition, notably characterized by a "region migration" phenomenon where linear regions migrate towards the decision boundary. This migration is crucial for the onset of grokking, aligning with the creation of a robust input space partition.

Observations and Contributions

This paper provides the first comprehensive documentation of delayed robustness as a variant of grokking and establishes a connection between LC dynamics and grokking events. Through extensive experiments, it demonstrates that both delayed generalization and robustness emerge from similar shifts in the network's partitioning of input space, facilitated by the linearization of the DNN mapping around training points.

Implications for Future Research

The implications of this paper are manifold, suggesting a deeper, inherent adaptability within DNNs that extends beyond traditional understanding. The observation that grokking can occur with respect to adversarial robustness opens new avenues for developing more resilient models. Furthermore, the introduction of LC as a tractable measure provides a meaningful way to monitor and potentially influence the training process towards desired outcomes.

Limitations and Future Directions

While this paper marks a significant advancement in understanding grokking, it acknowledges limitations including the absence of a theoretical framework to predict the double descent behavior of LC and its late occurrence in training. Moreover, the research opens questions regarding the dynamics under different optimizers and the potential relationship between region migration and phenomena like neural collapse. Future work will explore these aspects, aiming to unravel the complexities of DNN training and generalization further.

Concluding Remarks

In conclusion, this paper sheds light on the previously underexplored landscape of grokking, specifically introducing the notion of delayed robustness. By developing a nuanced understanding of local complexity dynamics, it lays foundational work for future explorations into the intricacies of deep learning models, their training behaviors, and inherent capabilities for unexpected adaptation and learning.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews