Proving the Lottery Ticket Hypothesis: Pruning is All You Need (2002.00585v1)

Published 3 Feb 2020 in cs.LG and stat.ML

Abstract: The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.

PDF Abstract

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

This paper rigorously addresses and extends the "lottery ticket hypothesis" initially posited by Frankle and Carbin in 2018. The paper's primary contribution is proving the existence of high-performing subnetworks within over-parameterized, randomly-initialized neural networks, confirming conjectures made by Ramanujan et al. in 2019. These theoretical advancements provide significant insights into neural network pruning, delineating conditions under which specific subnetworks can match the accuracy of a target network without additional training.

Summary of Findings

The authors establish that within any sufficiently large neural network, initialized with random weights, there exists a subnetwork capable of achieving performance levels comparable to a target network with bounded weights, under arbitrary bounded input distributions. This subnetwork achieves competitive accuracy without training. The paper distinguishes between two types of subnetworks: weight-subnetworks, which involve pruning specific weights, and neuron-subnetworks, where entire neurons are pruned.

Significant results include:

Weight-Subnetworks: The authors show that for any ReLU network of arbitrary depth $l$ , there exists a weight-subnetwork within a random network of depth $2l$ that approximates the target network efficiently. They establish that the number of parameters in such subnetworks is comparable to the original target network, up to a constant factor, demonstrating the expressive power of weight-pruned networks. This introduces fundamental insights into universality and computational complexity, confirming that random networks pruned through weights are universal approximators and prove computationally intractable, similar to dense networks.
Neuron-Subnetworks and Random Features Equivalence: The paper demonstrates that neuron pruning is equivalent to the random features model. Specifically, for certain distributions and networks, determining neuron-pruned subnetworks is as effective as optimizing random feature models. This equivalence underscores the limitations of neuron pruning, as they cannot efficiently approximate some functions, such as a single ReLU neuron under standard Gaussian distributions, contrasting the greater power of weight-pruning.
Practical Implications of Neuron Pruning: Despite its limitations, neuron pruning remains practically beneficial in optimizing finite datasets and learning within the reproducing kernel Hilbert space (RKHS). The authors illustrate that appropriately pruned neuron-subnetworks can overfit data samples or approximate RKHS functions without training entire networks, emphasizing computational efficiency in practical machine learning applications.

Implications and Future Directions

The theoretical framework offered by this paper provides a clear foundation for the development of more efficient neural network architectures. Pruning—particularly weight pruning—emerges as a promising focal point for designing algorithms that reduce the size and enhance the computational efficiency of neural networks without compromising performance. Although the computational hardness of finding an optimal pruned network is discussed, insights derived from the analysis suggest potential heuristic or approximate methods that could be viable in practical scenarios.

Future research could focus on improving pruning algorithms, examining the bounds on network size necessary for guaranteeing the existence of strong subnetworks, and extending these results to other neural architectures, including convolutional layers and ResNets. Additionally, exploring algorithms that leverage this pruning paradigm might lead to robust methodologies beyond traditional gradient-based training, potentially averting issues associated with standard optimization techniques.

In conclusion, this paper substantiates the lottery ticket hypothesis within a solid theoretical framework, revealing pruned networks as efficient, scalable alternatives to fully-trained models. The work significantly enriches our understanding of neural topology and encourages further exploration of pruned architectures in both theory and practice.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Eran Malach (37 papers)
Gilad Yehudai (26 papers)
Shai Shalev-Shwartz (67 papers)
Ohad Shamir (110 papers)

Citations (256)

View on Semantic Scholar

Proving the Lottery Ticket Hypothesis: Pruning is All You Need (2002.00585v1)