Logarithmic Pruning is All You Need

Published 22 Jun 2020 in cs.LG and stat.ML | (2006.12156v2)

Abstract: The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.

Abstract PDF Upgrade to Chat

Citations (82)

View on Semantic Scholar

Summary

The paper shows that neural networks can be significantly pruned using logarithmic scaling, reducing size while maintaining performance.
The authors introduce binary weight decomposition and product weight sampling to achieve precise weight approximation with fewer neurons.
The research outlines a batch sampling strategy for recycling neurons, offering a theoretical framework for efficient, cost-effective network training.

Logarithmic Pruning is All You Need

The paper "Logarithmic Pruning is All You Need" presents a significant advancement in the understanding and implementation of neural network pruning, particularly in relation to the Lottery Ticket Hypothesis. This hypothesis suggests that within large neural networks, there exist smaller subnetworks, or "winning tickets," that can achieve performance comparable to the original large network when trained in isolation. The authors aim to provide tighter theoretical bounds on the size of such networks, demonstrating that large networks contain many pruned subnetworks that are effective without prior training, under certain relaxed assumptions and using fewer resources.

Theoretical Framework and Contributions

The authors build upon prior work which has established the existence of such subnetworks within overparameterized neural networks. Previous work by Ramanujan et al. and Malach et al. relied on polynomial bounds for the size of the overparameterized network and made certain assumptions about the input and weight norms. In contrast, this paper demonstrates that the necessary network size can be reduced significantly, requiring only a logarithmic factor in terms of neurons per weight, except concerning depth. This enhancement is facilitated by three core innovations:

Binary Decomposition of Weights: The authors decompose weights into binary components, allowing them to be more precisely approximated using a logarithmic number of neurons. This technique leverages a hyperbolic distribution for weight sampling which has high density near zero, thereby improving the likelihood of accurate weight approximation using fewer samples.
Product Weights: By utilizing an induced product distribution for weights, the authors avoid the need to fix individual weights explicitly. Instead, they take advantage of the cumulative probability mass from different weight combinations, which reduces the required number of sampled weights and leads to more efficient pruning.
Batch Sampling: Rather than individually sampling neurons for each target weight, which would result in substantial waste, the authors propose a method where sampled weights can be 'recycled' across different approximate tasks within a layer, maximizing the usage of each sampled neuron.

Practical Implications

This research implies that large networks can effectively be pruned down to logarithmic sizes without a significant loss in accuracy, debunking the necessity of large-scale overparameterization in certain neural network applications. This realization could significantly cut down the computational costs associated with training and deploying large networks, making them feasible for more applications with limited resources. Furthermore, the approach proposed can be applied during the initial phases of network design and training, reducing the dependency on expensive training cycles with large networks.

Future Directions

The findings of this paper invite further exploration into several areas:

Extension to Other Network Structures: While the paper primarily considers fully connected ReLU networks, an immediate avenue for research is the application and validation of these pruning techniques in networks with convolutional, recurrent, or transformer architectures.
Experimental Validation: Although the theoretical groundwork is robust, practical experiments demonstrating these principles across diverse datasets and network architectures would provide valuable validation.
Exploration of Pruning in Training Optimization: Understanding how these pruning methods might be integrated or optimized in conjunction with stochastic gradient descent or other training algorithms for enhanced efficiency and speed.

Conclusion

The paper successfully demonstrates that a well-structured pruning approach can substantially reduce the complexity and size of neural networks without compromising performance. By leveraging a combination of innovative weight decomposition, distribution-based sampling, and efficient neuron utilization, the research lays the groundwork for more efficient and cost-effective neural network models. The theoretical strides made here open the door to broader applications, potentially transforming how neural network architectures are scaled and deployed in practice.

Markdown