Sparse Networks from Scratch: Faster Training without Losing Performance (1907.04840v2)

Published 10 Jul 2019 in cs.LG, cs.NE, and stat.ML

Abstract: We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that sparse momentum reliably reproduces dense performance levels while providing up to 5.61x faster training. In our analysis, ablations show that the benefits of momentum redistribution and growth increase with the depth and size of the network. Additionally, we find that sparse momentum is insensitive to the choice of its hyperparameters suggesting that sparse momentum is robust and easy to use.

PDF Abstract

Sparse Networks from Scratch: Analyzing Accelerated Training through Sparse Momentum

The paper "Sparse Networks from Scratch: Faster Training without Losing Performance," authored by Tim Dettmers and Luke Zettlemoyer, introduces an innovative method termed "sparse momentum" for training neural networks with sparse configurations from inception without requiring the initial dense network phase followed by pruning and re-training. This paper contributes to the ongoing discourse on sparse neural networks, particularly emphasizing the efficiency gains both in terms of performance and training speed.

Core Contributions

The authors propose sparse momentum, an algorithm utilizing exponentially smoothed gradients, or momentum, to efficiently identify weights and layers that are instrumental in minimizing error during training. Sparse momentum distinguishes itself by maintaining sparsity throughout the training process, redistributing pruned weights in a manner that enhances network efficiency.

Algorithmic Innovation: Sparse momentum follows a cyclical process involving pruning low-magnitude weights, redistributing weights based on momentum magnitudes across layers, and growing weights in layers with the highest potential (determined by momentum magnitude). This approach not only achieves dense performance levels but also accelerates training by a factor of up to 5.61x.
Empirical Validation: The method's robustness and efficacy are validated through experiments on MNIST, CIFAR-10, and ImageNet datasets. Sparse momentum shows an improvement in mean error rates by a relative 8%, 15%, and 6% over existing sparse algorithms on these datasets, respectively.
Structural Insights: The paper offers an in-depth analysis demonstrating that both momentum-based redistribution and growth become increasingly vital as networks grow deeper and larger—a key insight for designing efficient sparse networks.
Hyperparameter Sensitivity: Sparse momentum exhibits minimal sensitivity to its hyperparameters, indicating its robustness and ease of use across different model architectures.

Numerical Results and Comparisons

Sparse momentum outperforms traditional dense-to-sparse conversion methods, consistently achieving state-of-the-art performance with a significantly fewer number of weights required to match dense models. For instance, on CIFAR-10, AlexNet and VGG16 variants achieve dense-comparable error rates with only 35-50% and 5-10% of the weights, respectively. In the context of ImageNet with ResNet-50, sparse momentum achieves competitive Top-1 and Top-5 accuracy while operating in a fully sparse setting.

Theoretical and Practical Implications

The capacity to train networks with sparse configurations from scratch has profound implications. Theoretically, it suggests new paradigms in neural network architecture and optimization, potentially challenging the necessity of dense, over-parameterized initial models. Practically, it aligns with the goals of reducing computational costs and energy consumption, which are critical for deploying AI models on resource-constrained devices like mobile and IoT devices.

Future Developments

Sparse momentum's promise paves the way for further investigation into optimal sparse convolution algorithms and specialized hardware accelerators that leverage sparse matrix operations efficiently. Advancements in these areas could bridge existing algorithmic gaps and fully unleash the speed advantages proposed by sparse momentum.

In conclusion, Dettmers and Zettlemoyer's paper articulates a compelling narrative for sparse networks, blending rigorous algorithmic development with comprehensive empirical validation. It marks a significant step toward more resource-efficient deep learning models without sacrificing performance metrics. As the AI field progresses, such contributions are formidable in charting new directions for sustainable AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Tim Dettmers (22 papers)
Luke Zettlemoyer (225 papers)

Citations (322)

View on Semantic Scholar