SGD on Neural Networks Learns Functions of Increasing Complexity (1905.11604v1)

Published 28 May 2019 in cs.LG, cs.NE, and stat.ML

Abstract: We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is "retained" throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information.

Citations (222)

View on Semantic Scholar

Summary

The paper demonstrates that SGD initially favors simple linear classifiers before integrating more complex functions, clarifying neural networks' generalization.
It introduces a novel mutual information metric to assess how well a simpler classifier explains the performance of a more complex model during training.
The research offers practical insights for adjusting training routines and early stopping by emphasizing the significance of early learning dynamics in preventing overfitting.

An Overview of "SGD on Neural Networks Learns Functions of Increasing Complexity"

The paper "SGD on Neural Networks Learns Functions of Increasing Complexity" by Nakkiran et al. engages in an empirical and theoretical investigation of the dynamics of Stochastic Gradient Descent (SGD) when learning neural networks. The research seeks to explain the generalization capability of neural networks, particularly those trained with SGD, especially in the over-parameterized regime. A primary hypothesis explored is that SGD progresses through learning functions of increasing complexity, which may elucidate why SGD-trained networks tend to generalize well despite their capacity to overfit.

Key Contributions and Findings

The paper introduces a hypothesis about the trajectory of learning followed by SGD:

Initially, SGD favors simpler functions.
During later training stages, it retains some attributes of these simple classifiers even as it learns more complex representations.

This hypothesis is supported by evidence from empirical studies involving both real and synthetic datasets. One of the pivotal findings is that early training phases effectively function like linear classifiers. As training continues, more complex functions are incrementally integrated without forsaking the simplicity and effectiveness of earlier ones.

Key contributions of the paper include:

Empirical Observations: Through various experiments, the authors observe a distinct two-phase learning process:
- Initial Phase: Performance improvements are largely explained by linear classifiers.
- Intermediate and Final Phases: Even though the networks can represent complex functions, they retain—and leverage—the initial simpler classification abilities.
Theoretical Insights: The paper presents theoretical explanations for how SGD might implicitly retain simpler models' performance during further training iterations, even as it attains zero training error with potentially non-linear boundaries.
Novel Metric: The introduction of a new metric based on mutual information quantifies how well a simpler classifier (e.g., a linear classifier) explains the performance of a more complex model produced by SGD. This metric, focusing on conditional mutual information, is central to assessing the progression of learning complexity.

Implications and Future Work

Theoretical Implications

The research indicates potential directions for understanding implicit regularization mechanisms of SGD. The notion that SGD efficiently traverses through increasing model complexity offers insights into balancing model capacity and generalization.

Practical Implications

For practitioners, it emphasizes that early learning dynamics are crucial and that appropriate monitoring could inform decisions on stopping criteria or adjustments in training routines without risk of overfitting.

Future Directions

Future research could focus on:

Formalizing a complexity measure that captures the essence of learning stages in SGD and correlates with generalization.
Exploring initialization schemes or controlled interventions in SGD that might accelerate introductory phases or enhance generalization beyond standard settings.
Extending the analysis to different forms of neural architectures and alternative optimization strategies to generalize the findings beyond those settings initially explored.

In sum, this paper provides a valuable framework for understanding the layered competency acquisition by SGD on neural networks, establishing a foundation for deeper explorations into explaining neural networks' generalization capabilities.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ElanRosenfeld/status/1755789681625588068

https://twitter.com/aidanogara_/status/1793511257418637554

https://twitter.com/basavasagar18/status/1886893744437518397