Faster CNNs with Direct Sparse Convolutions and Guided Pruning: A Summary
This paper introduces novel methodologies to enhance the computational efficiency of Convolutional Neural Networks (CNNs) through direct sparse convolutions and guided pruning strategies. The authors aim to address the challenge of CNNs' excessive parameter count, which traditionally results in substantial computational overhead, particularly in the convolution layers that dominate CNNs' processing time.
Key Contributions
- Direct Sparse Convolutions: The authors propose a direct sparse convolution technique as a core advancement. This method reformulates sparse convolutions as sparse-matrix-dense-matrix multiplications without the usual overhead of lowering input tensors to matrices—a process noted to reduce arithmetic intensity and efficiency. This approach allows the convolution operations to maintain a high arithmetic intensity by using a "virtual" dense matrix, enhancing data reuse especially in multi-channel scenarios.
- Performance Modelling: A sophisticated performance model is developed to predict speedup potentials and guide the pruning process. The model uses the operational roofline to calculate potential speed improvements depending on the non-zero density of the sparse convolution kernels and the characteristics of specific processor architectures. Notably, the model demonstrates that even moderate sparsity in the range of 70% can facilitate substantial speed increases using the devised methods.
- Guided Sparsity Learning (GSL): The paper introduces Guided Sparsity Learning (GSL), an innovative pruning algorithm that strategically focuses on layers and sparsity ranges promising tangible speedup, informed by the performance model. Unlike typical pruning, GSL ceases pruning efforts in layers that fall outside effective sparsity ranges, reallocating effort where maximal speedup is achievable.
- Empirical Validation: The methods are empirically validated through experiments on AlexNet and GoogLeNet across diverse computational platforms—Intel Atom, Xeon, and Xeon Phi processors, showing promising speedups (up to 7.3× for AlexNet on the Atom processor) without compromising model accuracy.
Implications and Future Directions
The implications arising from this research are multifold:
- From a theoretical perspective, this work expands on the potential of direct sparse computation in deep learning frameworks, effectively bridging the gap between pruning-induced model size reduction and actual inference speedup.
- Practically, the proposed methodologies align with current trends towards deploying CNNs on resource-constrained environments, such as mobile and edge computing, where computational efficiency is paramount.
- Future Work: While the current implementation focused on direct sparse convolution efficiencies, the authors propose potential extensions incorporating Winograd and FFT-based algorithms to further refine 1×1 convolution efficiencies, currently not addressed by sparsity methods due to inherent low arithmetic intensity.
Overall, this paper makes substantive contributions towards more computationally efficient CNN implementations, establishing a practical approach for systematically leveraging model sparsity for faster inference while maintaining a theoretical underpinning through performance modelling. Through continued applications and optimizations, these advancements promise to significantly enhance the deployment capabilities of deep learning models across an expanded array of hardware platforms.