The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (1803.03635v5)

Published 9 Mar 2018 in cs.LG, cs.AI, and cs.NE

Abstract: Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.

PDF Abstract

An Essay on "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks"

The paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks," authored by Jonathan Frankle and Michael Carbin, proposes a paradigm-shifting perspective on neural network pruning and initialization, termed as the "Lottery Ticket Hypothesis" (LTH). This hypothesis suggests that within a large, randomly-initialized neural network, there exist smaller subnetworks—referred to as "winning tickets"—that can be trained in isolation to achieve performance comparable to the original network.

Summary and Key Findings

The central question addressed by the paper is motivated by the widespread practice of pruning neural networks after they have been trained. Such pruning substantially reduces the parameter count—often by more than 90%—without sacrificing performance. However, one perplexing observation is that these pruned subnetworks do not train well when they are initialized randomly from scratch, indicating some special quality about their specific weights and connections derived from the initial large network.

Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis posits:

A randomly-initialized, dense neural network contains a subnetwork that—when trained in isolation—can match the test accuracy of the original network after training for at most the same number of iterations.

To formalize, consider a dense feed-forward neural network $f(x; \theta)$ with initial parameters $\theta = \theta_0 \sim \mathcal{D}_\theta$ . When trained with SGD, the network reaches minimum validation loss $l$ at iteration $j$ with test accuracy $a$ . The hypothesis suggests that there exists a mask $m \in \{0, 1\}^{|\theta|}$ such that a subnetwork $f(x; m \odot \theta_0)$ can be optimized to match the original network's accuracy $a$ and training time $j$ .

Methodology

The authors devised an algorithm to identify winning tickets through an iterative pruning strategy:

Randomly initialize the neural network.
Train the network for a specified number of iterations.
Prune the lowest-magnitude weights layer-wise.
Reset the remaining weights to their initial values and repeat the process.

An alternative one-shot pruning approach, where only a single pruning is performed followed by training, was also evaluated, but iterative pruning proved more effective in identifying smaller, performant winning tickets.

Experimental Validation

Extensive experiments demonstrated the hypothesis across various architectures and datasets. The experiments encompassed:

Fully-Connected Networks on MNIST: Winning tickets were consistently found that maintained accuracy while being significantly smaller than the original network. For instance, a network pruned to 3.6% of its original size still performed on par with the original.
Convolutional Networks on CIFAR10: Similar patterns emerged for Conv-2, Conv-4, and Conv-6 architectures, with winning tickets achieving faster learning and comparable or even superior test accuracy.
Large Networks like VGG-19 and Resnet-18 on CIFAR10: The hypothesis extended to deeper networks, albeit with nuances. Iterative pruning required careful management of learning rates, including the use of warmup phases for gradient updates to find performant winning tickets.

Implications

The findings suggest profound implications for the theoretical understanding and practical deployment of neural networks:

Training Efficiency: Identifying winning tickets can potentially reduce training costs, both in terms of time and computational resources. This could lead to more efficient training paradigms where the network is pruned and reset early in the training process.
Architecture Design: The specific structures of winning tickets could inspire new architectures that are inherently sparse and efficient, leveraging the inductive biases revealed by the winning tickets.
Generalization and Optimization: Winning tickets offer a lens to examine the generalization capabilities of neural networks. The ability of smaller subnetworks within overparameterized models to achieve superior generalization aligns with existing theories linking network complexity and generalization ability. Moreover, understanding why certain initializations lead to effective optimization could advance our comprehension of the loss landscape in neural networks.
Transfer Learning: Winning tickets identified for a particular task might serve as a robust initialization for related tasks, facilitating transfer learning and multi-task learning scenarios.

Future Work

Future research could focus on several directions:

Scalability: Efficient algorithms for identifying winning tickets in large-scale datasets like ImageNet.
Initialization Studies: Investigating the properties of winning ticket initializations to distinguish them from others.
Pruning Techniques: Exploring non-magnitude-based and structured pruning methods to find even smaller and more hardware-efficient winning tickets.

Conclusion

The Lottery Ticket Hypothesis sheds light on the inherent capacity of overparameterized neural networks to house smaller, trainable subnetworks. By leveraging iterative pruning and the fortuitous initial weights, the authors uncover a paradigm that not only enhances our understanding of neural network training dynamics but also opens avenues for more efficient and optimized model design.