Proving the Lottery Ticket Hypothesis: Pruning is All You Need

Published 3 Feb 2020 in cs.LG and stat.ML | (2002.00585v1)

Abstract: The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.

Abstract PDF Upgrade to Chat

Citations (256)

View on Semantic Scholar

Summary

The paper provides a theoretical proof that, in over-parameterized networks, pruning yields a weight-subnetwork that approximates any target ReLU network.
It demonstrates that pruning is as effective as weight optimization by showing both weight- and neuron-subnetworks can match target performance.
The study highlights the computational challenges inherent in finding optimal subnetworks, paralleling the hardness of training neural networks.

"Proving the Lottery Ticket Hypothesis: Pruning is All You Need"

Introduction

The paper "Proving the Lottery Ticket Hypothesis: Pruning is All You Need" addresses the hypothesis originally proposed by Frankle and Carbin, which suggests the existence of small, high-performing subnetworks within large, randomly-initialized neural networks. This research extends the hypothesis by providing theoretical proofs that such subnetworks exist in over-parameterized networks without requiring further training. The authors categorize subnetworks into weight-subnetworks and neuron-subnetworks, exploring the conditions under which each can approximate target network performance.

Theoretical Contributions

The main theoretical contribution is a proof that, under certain conditions, a sufficiently large randomly-initialized neural network contains a subnetwork that achieves accuracy comparable to a target network. Specifically, for any target ReLU network with bounded weights, a weight-subnetwork can be found within a larger network that closely approximates the target's function, using random initialization.

The proof involves constructing an approximation by pruning a network of depth twice that of the target network. For shallow networks, neuron-subnetworks were shown to be competitive with the random features model, a significant finding given known limitations of random features in capturing complex functions.

Results on Neural Network Pruning

The authors conclude that pruning weight or neuron elements from an over-parameterized network is as effective as optimizing the network's weights for approximating target functions. This implies that in terms of expressive power, pruning is an adequate substitute for training. The required network size before pruning is polynomially dependent on the problem parameters.

Computational Implications

While demonstrating expressive power, the paper also establishes the computational difficulty in finding these subnetworks: finding an optimal subnetwork remains computationally hard, akin to training a neural network. This aligns with the established hardness of learning even simple networks.

The equivalence drawn between random features and neuron-subnetworks suggests inherent limitations in approximating more complex functions, such as single ReLU neurons, efficiently using these methods.

Practical Implications and Future Work

Practically, the results motivate the development of more sophisticated pruning algorithms aimed at both theoretical soundness and empirical efficacy, particularly in reducing inference costs and diversifying beyond gradient-based optimization. Future research directions include refining the computational efficiency of pruning and extending results across varying architectures, e.g., convolutional networks and ResNets.

Conclusion

The paper fortifies the theoretical foundation of the Lottery Ticket Hypothesis, establishing a strong case for pruning as a viable and potent method in neural network approximation. It opens pathways for further exploration into the pruning paradigm with the potential of offering computational advantages, backed by rigorous theoretical grounding.

Markdown