The State of Sparsity in Deep Neural Networks (1902.09574v1)

Published 25 Feb 2019 in cs.LG and stat.ML

Abstract: We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al., 2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification.

View on arXiv

Authors (3)

Trevor Gale (10 papers)
Erich Elsen (28 papers)
Sara Hooker (71 papers)

Citations (701)

View on Semantic Scholar

Summary

The State of Sparsity in Deep Neural Networks: A Comprehensive Evaluation

The paper "The State of Sparsity in Deep Neural Networks" conducted by Trevor Gale, Erich Elsen, and Sara Hooker rigorously investigates the performance and applicability of three state-of-the-art sparsity-inducing techniques: variational dropout, $l_0$ -regularization, and magnitude pruning. These methods were evaluated across two large-scale tasks: Transformer trained on WMT 2014 English-to-German and ResNet-50 trained on ImageNet. The paper stands apart by its scaled application of these techniques, seeking to address inconsistencies and establish a clear benchmark within the context of model compression.

Key Contributions and Findings

Comprehensive Experimentation and Comparison: The authors conducted thousands of experiments to explore the effectiveness of the studied techniques, finding that simpler magnitude pruning often achieves comparable or superior results compared to more complex methods like variational dropout and $l_0$ -regularization. This conclusion challenges the prevailing thought that more sophisticated techniques are necessarily better in large-scale applications.
Success of Magnitude Pruning: While variational dropout and $l_0$ -regularization have shown promise in smaller datasets, their performance in large-scale tasks was inconsistent. Magnitude pruning, a relatively straightforward technique, not only offered comparable results but also established a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 on ImageNet—a notable achievement, given the historical focus on more complex approaches.
Reevaluation of Foundational Hypotheses: The paper also revisits the "lottery-ticket hypothesis" and related theories proposing that sparse architectures discovered through pruning can be retrained from scratch to achieve similar performance. The authors' findings present strong counterexamples indicating that, especially at higher sparsity levels, retraining from scratch fails to match the performance of models jointly optimized and pruned.
Practical Insights and Baselines: By open-sourcing their code, model checkpoints, and hyperparameter settings, the authors provide invaluable resources for the research community. This openness encourages reproducibility and establishes rigorous benchmarks for future research in the field of sparsification and model compression.

Numerical Results and Analysis

Transformers on WMT 2014:

Magnitude pruning achieved a BLEU score trade-off curve that consistently outperformed variational dropout and $l_0$ -regularization, particularly at higher sparsity levels. For example, models with 90% sparsity maintained a superior BLEU score compared to their counterparts, underscoring the efficacy of simpler magnitude pruning even as the majority of weights were pruned.

ResNet-50 on ImageNet:

Variational dropout showed strong performance at low levels of sparsity for ResNet-50, but magnitude pruning ultimately achieved the best results across most sparsity levels. Particularly notable was the result that, with heuristic distribution of weights (e.g., keeping the first convolutional layer dense), magnitude pruning achieved state-of-the-art performance with minimal accuracy compromise at high sparsity (e.g., 98%).

Implications and Future Directions

The findings of this paper have profound implications both theoretically and practically. Practically, magnitude pruning provides a computationally efficient yet effective method for sparsifying large-scale models, making it appealing for deployment in resource-constrained environments. Theoretically, the paper raises important questions regarding the scalability and generalizability of complex sparsity-inducing techniques.

The counterexamples provided against the lottery-ticket hypothesis suggest that sparsity as an architecture search may not be as straightforward for large-scale, complex tasks. Future research might focus on hybrid methods that combine the simplicity and efficiency of magnitude pruning with the theoretical robustness of variational techniques or explore new approaches entirely.

Conclusion

This paper rigorously evaluates the state-of-the-art methods for inducing sparsity in deep neural networks, providing crucial insights and establishing new benchmarks in the field. The demonstration that simple magnitude pruning can outperform more complex techniques on large-scale tasks challenges current assumptions and paves the way for future innovations. The open-sourcing of tools and benchmarks facilitates further advancements and ensures that future research is grounded in robust, reproducible results.

The paper makes clear the need for large-scale benchmarks in sparsification research and sets a high standard for future work in model compression and neural network optimization.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/bozavlado/status/1838098369215938699

https://twitter.com/KhonaMikail/status/1885845951941857347

https://twitter.com/KhonaMikail/status/1885868556925194441

YouTube

Show All Videos