Linear Mode Connectivity and the Lottery Ticket Hypothesis (1912.05671v4)

Published 11 Dec 2019 in cs.LG, cs.NE, and stat.ML

Abstract: We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).

Citations (530)

View on Semantic Scholar

Summary

The paper demonstrates that neural networks achieve early stability under SGD noise, enabling linear mode connectivity between weight configurations.
The study employs instability analysis and linear interpolation to measure error barriers during training across datasets like MNIST, CIFAR-10, and ImageNet.
The paper finds that lottery ticket subnetworks only reach full accuracy once the network has stabilized, suggesting opportunities for early pruning.

Linear Mode Connectivity and the Lottery Ticket Hypothesis

In the paper "Linear Mode Connectivity and the Lottery Ticket Hypothesis," Frankle et al. present a paper on how neural networks optimize under different samples of stochastic gradient descent (SGD) noise and the implications for sparse networks known as lottery tickets.

Summary

The research explores the stability of neural networks to SGD noise, asserting that standard vision models become stable early in training. It investigates the phenomenon of linear mode connectivity, where the minima found by networks are connected by a path of non-increasing error, specifically focusing on the linearly connected minima under different samples of SGD noise.

The paper employs a technique called instability analysis, which involves training two copies of the same network under different SGD noise and measuring the linear interpolation of their weights. If the changes in error along this path are negligible, the networks are considered stable and linearly connected. The paper finds that standard networks across various datasets, including MNIST, CIFAR-10, and ImageNet, achieve stability early in the training process.

Implications for Lottery Ticket Hypothesis

The paper examines iterative magnitude pruning (IMP) within the context of the lottery ticket hypothesis. IMP identifies sparse subnetworks that theoretically could have been trained in isolation to the same accuracy as the full network. The paper finds that these lottery ticket subnetworks only reach full accuracy when stable to SGD noise. For smaller networks such as those used in MNIST, stability occurs at initialization, whereas for larger architectures like ResNet-50 and Inception-v3 on ImageNet, it occurs early in the training process.

Methodology and Results

Key methodologies include:

Instability Analysis: Applied to measure a network’s sensitivity to SGD noise by observing error changes when interpolating between weights of networks trained under different permutations of data.
Linear Interpolation: Evaluates the error barrier height between weights at the start and end of training paths, determining linear mode connectivity if the instability is close to zero.

Results indicate that for Imagenet, the networks achieved stability at around 20% of the total training epochs. Substantial numerical results highlight stability points and comparative accuracies, providing valuable insights for large-scale and small-scale settings.

Broader Implications

The findings reveal that without early stability, iterative pruning fails to identify subnetworks that match the accuracy of the unpruned networks. This extends the lottery ticket hypothesis, suggesting mechanisms within training dynamics that bring sparse subnetworks to maturity only after some training. This has practical implications for training efficiency, suggesting that optimal pruning could occur early in training rather than late.

Conclusions and Future Directions

The results offer both theoretical and practical contributions, emphasizing the nature of SGD noise in shaping neural network convergence. Future work might explore changes in optimization strategies post-stability—such as altering learning schedules—to potentially improve performance. Additionally, the implications for the lottery ticket hypothesis may guide new strategies for early pruning, enhancing the efficiency of neural network training.

Frankle et al. succeed in elucidating a crucial aspect of neural network dynamics, providing a scientific tool for further exploration into both the robustness and optimization of deep learning models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jeffreycider/status/1759712029202669963

YouTube

Show All Videos