Deep Networks Always Grok and Here is Why (2402.15555v2)
Abstract: Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on the local complexity of a DNN's input-output mapping. Our local complexity measures the density of so-called linear regions (aka, spline partition regions) that tile the DNN input space and serves as a utile progress measure for training. We provide the first evidence that, for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial
- Towards understanding sharpness-aware minimization. In International Conference on Machine Learning, pp. 639–668. PMLR, 2022.
- A spline theory of deep networks. In Proc. ICML, pp. 374–383, 2018.
- Batch normalization explained. arXiv preprint arXiv:2209.14778, 2022.
- Police: Provably optimal linear constraint enforcement for deep neural networks. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
- Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
- Are all linear regions created equal? In AISTATS, pp. 6573–6590, 2022.
- Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia Tools and Applications, 79:12777–12815, 2020.
- Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019.
- Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values. In CVPR, pp. 10641–10650, 2022.
- Splinecam: Exact visualization and characterization of deep network geometry and decision boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3789–3798, June 2023a.
- Provable instance specific robustness via linear constraints. In 2nd AdvML Frontiers Workshop at International Conference on Machine Learning 2023, 2023b.
- Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Test sample accuracy scales with training sample density in neural networks. In Conference on Lifelong Learning Agents, pp. 629–646. PMLR, 2022.
- Implicit regularization in over-parameterized neural networks. arXiv preprint arXiv:1903.01997, 2019.
- Why robust generalization in deep learning is difficult: Perspective of expressive power. Advances in Neural Information Processing Systems, 35:4370–4384, 2022.
- Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- On the number of linear regions of deep neural networks. In NeurIPS, pp. 2924–2932, 2014.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
- Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
- Adversarial robustness through local linearization. Advances in Neural Information Processing Systems, 32, 2019.
- On the expressive power of deep neural networks. In ICML, pp. 2847–2854, 2017.
- A blessing of dimensionality in membership inference through regularization. In International Conference on Artificial Intelligence and Statistics, pp. 10968–10993. PMLR, 2023.
- Handbook of discrete and computational geometry. CRC press, 2017.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
- Robustness and generalization. Machine learning, 86:391–423, 2012.
- Comparing wiener complexity with eccentric complexity. Discrete Applied Mathematics, 290:7–16, 2021.
- Benign overfitting and grokking in relu networks for xor cluster data. arXiv preprint arXiv:2310.02541, 2023.
- Max-affine spline insights into deep network pruning. arXiv preprint arXiv:2101.02338, 2021.