Deep Networks Always Grok and Here is Why (2402.15555v2)

Published 23 Feb 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error. Previous studies have reported the occurrence of grokking in specific controlled settings, such as DNNs initialized with large-norm parameters or transformers trained on algorithmic datasets. We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings, such as training of a convolutional neural network (CNN) on CIFAR10 or a Resnet on Imagenette. We introduce the new concept of delayed robustness, whereby a DNN groks adversarial examples and becomes robust, long after interpolation and/or generalization. We develop an analytical explanation for the emergence of both delayed generalization and delayed robustness based on the local complexity of a DNN's input-output mapping. Our local complexity measures the density of so-called linear regions (aka, spline partition regions) that tile the DNN input space and serves as a utile progress measure for training. We provide the first evidence that, for classification problems, the linear regions undergo a phase transition during training whereafter they migrate away from the training samples (making the DNN mapping smoother there) and towards the decision boundary (making the DNN mapping less smooth there). Grokking occurs post phase transition as a robust partition of the input space thanks to the linearization of the DNN mapping around the training points. Website: https://bit.ly/grok-adversarial

References (35)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces delayed robustness, showing that deep neural networks can enhance adversarial resistance well after initial training.
It develops a novel local complexity metric to quantify the migration of linear regions toward decision boundaries during training.
Extensive experiments across various architectures and datasets demonstrate that generalization and robustness emerge from similar network partition shifts.

Exploring the Phenomena of Grokking in Deep Neural Networks

Introduction to Grokking

Grokking, a phenomenon characterized by a delayed generalization in deep neural networks (DNNs), has been an area of recent intrigue in machine learning research. Classically, grokking has been observed where a DNN, despite achieving near-zero training error, does not initially generalize well to unseen data. Over time, and without changes to the learning regimen, the model begins to exhibit improved performance on test data, a process referred to as "grokking."

Delayed Robustness: A Novel Form of Grokking

This paper introduces the concept of delayed robustness, expanding the understanding of grokking. Delayed robustness is observed when DNNs exhibit increased robustness to adversarial examples well after initial training phases, marking a notable departure from traditional notions of training dynamics where robustness and generalization are often seen as competing objectives. This phenomenon was identified across various architectures including fully connected networks, convolutional neural networks (CNNs), and ResNets, trained on datasets such as MNIST, CIFAR10, CIFAR100, and Imagenette.

Local Complexity and Its Role in Grokking

Central to understanding and quantifying grokking is the introduction of a novel measure dubbed local complexity (LC), which evaluates the density of linear regions within the DNN's input-output mapping. This measure is posited to be a significant indicator of a network's readiness to "grok." It was observed that during training, the LC undergoes a phase transition, notably characterized by a "region migration" phenomenon where linear regions migrate towards the decision boundary. This migration is crucial for the onset of grokking, aligning with the creation of a robust input space partition.

Observations and Contributions

This paper provides the first comprehensive documentation of delayed robustness as a variant of grokking and establishes a connection between LC dynamics and grokking events. Through extensive experiments, it demonstrates that both delayed generalization and robustness emerge from similar shifts in the network's partitioning of input space, facilitated by the linearization of the DNN mapping around training points.

Implications for Future Research

The implications of this paper are manifold, suggesting a deeper, inherent adaptability within DNNs that extends beyond traditional understanding. The observation that grokking can occur with respect to adversarial robustness opens new avenues for developing more resilient models. Furthermore, the introduction of LC as a tractable measure provides a meaningful way to monitor and potentially influence the training process towards desired outcomes.

Limitations and Future Directions

While this paper marks a significant advancement in understanding grokking, it acknowledges limitations including the absence of a theoretical framework to predict the double descent behavior of LC and its late occurrence in training. Moreover, the research opens questions regarding the dynamics under different optimizers and the potential relationship between region migration and phenomena like neural collapse. Future work will explore these aspects, aiming to unravel the complexities of DNN training and generalization further.

Concluding Remarks

In conclusion, this paper sheds light on the previously underexplored landscape of grokking, specifically introducing the notion of delayed robustness. By developing a nuanced understanding of local complexity dynamics, it lays foundational work for future explorations into the intricacies of deep learning models, their training behaviors, and inherent capabilities for unexpected adaptation and learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/randall_balestr/status/1762490816646308164

https://twitter.com/ducha_aiki/status/1762446968410738886

https://twitter.com/fly51fly/status/1762473071062229142

https://twitter.com/patrickmineault/status/1824097802844029239

https://twitter.com/burny_tech/status/1890474552922145005

https://twitter.com/pumpsciencemice/status/1867741099441500575

YouTube

Show All Videos

HackerNews

Deep Networks Always Grok and Here Is Why (1 point, 0 comments)