The Early Phase of Neural Network Training (2002.10365v1)

Published 24 Feb 2020 in cs.LG, cs.NE, and stat.ML

Abstract: Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this behavior, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are not inherently label-dependent, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning.

Citations (159)

View on Semantic Scholar

Summary

The paper reveals that early training comprises distinct sub-phases with rapid weight changes that gradually stabilize and slow in improvement.
The paper demonstrates that preserving early weight configurations is crucial, as re-initialization or permutation hampers deeper networks' performance.
The paper shows that accurate data labeling and self-supervised pre-training are vital to replicate beneficial early adjustments effectively.

Insights into the Early Phase of Neural Network Training

The paper "The Early Phase of Neural Network Training" conducted by Jonathan Frankle, David J. Schwab, and Ari S. Morcos provides an extensive analysis of the dynamics of neural networks during their initial training stage. This stage, often overlooked in favor of later epochs where models achieve convergence, is shown to be critical for the overall learning process. The authors offer quantitative insights into how network weights evolve in the early training epochs and how this period shapes the later capacity of neural networks to generalize and perform effectively.

The paper employs the methodology of Iterative Magnitude Pruning with rewinding (IMP), integrating the lottery ticket hypothesis to elucidate the sequential weight transformations pivotal during early training. The lottery ticket hypothesis posits that within a neural network, one can find a smaller sub-network that can be trained to perform on par with the large network when initialized correctly. Initially validated on small networks, this research explores its implications for deeper architectures like ResNets.

Key Findings

Early Phase Changes:
- The research highlights three distinct sub-phases within the first few thousand iterations of training.
- An initial phase characterized by rapid weight changes subsides into a second phase, where gradients normalize, and accuracy climbs swiftly.
- In the final sub-phase, though the network continues to learn, the improvement rate decelerates.
Importance of Weight Configurations:
- The paper finds that deeper networks are not robust to re-initialization with random weights, contradicting earlier findings for shallower networks.
- Maintaining weight magnitudes from early training iterations is crucial, suggesting that the network configurations acquired during these initial phases are instrumental in enabling sparsity and achieving high accuracy.
- Permuting weights after early training shows detrimental impacts, demonstrating weights are not independently and identically distributed.
Data Dependence:
- Contrary to intuition, training with unstructured labels (random labels) does not duplicate the favorable early network adjustments, underscoring the necessity of accurate data labeling.
- Self-supervised pre-training tasks help achieve a comparable network state for effective early-phase learning but require considerably more epochs compared to supervised tasks.

Implications

These results provide a deeper understanding of the early stages of training neural networks and challenge some existing assumptions about model initialization and robustness. Foremost, the cruciality of early phases in determining model capabilities questions the sufficiency of random initializations combined merely with sign preservation in deeper networks, a finding relevant to the optimization strategies for sparse sub-networks.

The observed non-independence of weights after limited training iterations poses implications for simplifying models while maintaining performance, suggesting that exploring weight distribution and dependencies in-depth can offer leverage points for model efficiency, especially in larger-scale networks.

Future Directions

The findings pave the way for multiple research avenues:

The dynamics of early training phases in varying architectures beyond those explored can unveil differing structural sensitivities.
Considering different pre-training tasks could provide further insights into the interplay between unsupervised/pretext tasks and essential supervised learning phases.
Understanding how hyper-parameters and other training modalities influence these early stages can further optimize training protocols and model configurations.

In conclusion, this paper highlights a significant aspect of neural network training that warrants focused exploration to fully understand and efficiently harness the capabilities of deep learning models. The early phase is critical, not just as a prelude to effective model training but as an essential stage that determines the foundational structure of the final model.

PDF Markdown

Related Papers

YouTube

Show All Videos