Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning (2411.13545v2)

Published 20 Nov 2024 in cs.CV

Abstract: Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels. Obtaining sparse networks at such extreme sparsity levels presents unique challenges, such as fragile gradient flow and heightened risk of layer collapse. In this work, we explore network performance beyond the commonly studied sparsities, and propose a collection of techniques that enable the continuous learning of networks without accuracy collapse even at extreme sparsities, including 99.90%, 99.95% and 99.99% on ResNet architectures. Our approach combines 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet, achieving significant performance improvements over state-of-the-art methods we compared with.

Authors (5)

Andy Li (7 papers)
Aiden Durrant (8 papers)
Milan Markovic (4 papers)
Lu Yin (85 papers)
Georgios Leontidis (33 papers)

Summary

Extreme Adaptive Sparse Training (EAST): A Method for Pruning at Extreme Sparsity Levels

The paper "Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning" explores an advanced methodology for training deep neural networks with extreme sparsity, reaching up to 99.99% sparsity on ResNet architectures. The authors introduce Extreme Adaptive Sparse Training (EAST), a novel approach that combines several nuanced techniques to manage sparsity without a significant drop in network performance, even at exceedingly high levels of compression.

Background and Problem Statement

Network pruning has long been recognized as a pivotal method for enhancing the efficiency of deep neural networks by reducing their parameter count. The traditional pruning techniques, however, are often ineffective or require recalibration at extreme sparsity levels—around 99.90% to 99.99%—due to challenges such as fragile gradient flow and layer collapse. Consequently, the task of maintaining network robustness and performance at such extreme levels of sparsity remains underexplored, largely driven by the growing need to deploy neural networks on edge devices that demand minimal computational resources.

Methodological Innovations

EAST consolidates three innovative strategies to stabilize the performance of pruned networks at extreme sparsity thresholds:

Dynamic ReLU Phasing: This technique initiates the training with Dynamic ReLU (DyReLU) activations, which adaptively adjust based on the input, allowing for extensive parameter exploration during the initial training phases. As training progresses, DyReLU is gradually phased out and replaced by the standard ReLU activation, enabling the network to leverage richer activation functions initially and then stabilize.
Weight Sharing: Within each residual layer, parameters are shared across blocks. This design enhances gradient flow paths and leverages multiple propagation routes while maintaining the constraints of the parameter budget, effectively aiding gradient stability in highly sparse networks.
Cyclic Sparsity Scheduling: Differentiating itself from static sparsity, this scheduling dynamically adjusts the sparsity pattern during training, allowing parameters to switch between multiple states of connectivity and thereby encouraging thorough parameter exploration.

Empirical Evaluation

The authors report extensive testing on CIFAR-10, CIFAR-100, and ImageNet datasets using ResNet-34 and ResNet-50 architectures. EAST consistently outperforms conventional methods like SynFlow and RigL at extreme sparsity levels, showcasing superior accuracy retention. For instance, at 99.99% sparsity on CIFAR-10 with ResNet-34, EAST maintains an accuracy of 70.57%, a substantial improvement over models that produced random accuracies under similar conditions.

Moreover, the experimental setup demonstrates EAST's viability in scaling to larger datasets such as ImageNet, retaining higher practical relevance and performance fidelity compared to state-of-the-art methods like DCTpS when subjected to equivalent training constraints.

Theoretical and Practical Implications

EAST is profound for several reasons. The method not only addresses issues of gradient vanishing and layer collapse prevalent in extreme sparsity contexts but also provides a scalable solution that can be readily integrated into existing frameworks. Given that dynamic sparse training (DST) methodologies typically struggle at these sparsity extremes, EAST's ability to circumvent performance degeneration fundamentally broadens the horizons for future research and practical implementations of network pruning.

The paper sets the stage for further examining cyclic sparsity schedules and fine-tuning dynamic activations, potentially inspiring future work to refine performance metrics in resource-constrained AI applications.

Conclusion

Concluding, the paper makes measurable advancements in the field of sparse neural network training. By combining diverse and adaptive techniques through EAST, the authors demonstrate a significant leap in training efficacy at extreme sparsities with promising potential for deployment on edge devices. This work enriches the underlying theory of sparse learning and offers a practical, adaptable framework with the promise of facilitating substantial efficiencies in neural network deployments.

Related Papers

Tweets

https://twitter.com/fly51fly/status/1859720826003845580