Overview of "Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications"
This paper addresses the problem of deploying deep Convolutional Neural Networks (CNNs) on mobile devices, which have limited computational power, battery, and memory capacity. The authors propose a novel approach called "one-shot whole network compression" for compressing entire CNNs to make them more suitable for mobile environments. This method involves three primary steps: rank selection with variational Bayesian matrix factorization (VBMF), Tucker decomposition on kernel tensor, and fine-tuning to mitigate the loss of accuracy.
Key Contributions
- One-Shot Whole Network Compression Scheme:
The authors introduce a simple three-step process for compressing CNNs: - Rank Selection: Utilizing VBMF to determine the rank of each layer. - Tucker Decomposition: Applying Tucker decomposition to compress the kernel tensor of each layer. - Fine-Tuning: Reducing the accumulated loss of accuracy through fine-tuning.
- Practical Implementation: Each step is implemented using publicly available tools: VBMF for rank determination, Tucker tensor toolbox for decomposition, and Caffe for fine-tuning. This ensures the approach is accessible for wide adoption.
- Empirical Evaluation: The scheme's effectiveness is demonstrated on several popular CNN architectures, including AlexNet, VGG-S, GoogLeNet, and VGG-16, evaluated on both a high-performance GPU (Titan X) and a smartphone (Samsung Galaxy S6). The results show substantial reductions in model size, runtime, and energy consumption with minimal accuracy losses.
Experimental Results
The experimental results are noteworthy for four primary CNNs:
- AlexNet: Achieved a 5.46x reduction in model size, 2.67x reduction in FLOPs, and 2.72x improvement in runtime on a smartphone, with a 1.70% accuracy loss.
- VGG-S: Presented a 7.40x reduction in model size, 4.80x reduction in FLOPs, and 3.68x runtime improvement on a smartphone, with a mere 0.55% accuracy loss.
- GoogLeNet: Demonstrated a 1.28x reduction in model size, 2.06x reduction in FLOPs, and 1.42x improvement in runtime on a mobile device, with a 0.24% accuracy loss.
- VGG-16: Achieved a 1.09x reduction in model size, 4.93x reduction in FLOPs, and 3.34x increase in runtime on a smartphone, with a 0.50% accuracy loss.
Additionally, the fine-tuning process quickly recovered the accuracy drop caused by the compression, with substantial improvements achieved within the first epoch of fine-tuning.
Analysis
The detailed layer-wise analysis reveals that the proposed compression method is more effective on mobile devices than on high-performance GPUs due to reduced cache conflicts and memory latencies. This is particularly impactful for fully-connected layers, where the reduction in weights significantly enhances cache performance. Furthermore, the paper emphasizes the importance of the 1x1 convolution operation, widely used in both the inception modules of GoogLeNet and the compressed models, but noted for its cache inefficiency.
Implications and Future Directions
The implications of this research are both practical and theoretical. Practically, this compression scheme provides an efficient method to deploy deep learning models on resource-constrained mobile devices without substantial accuracy loss. Theoretically, it presents a streamlined approach to whole network compression, combining VBMF and Tucker decomposition in a practical framework.
Future research could explore the following:
- Optimal Rank Selection: Further investigation into whether the chosen ranks via VBMF are indeed optimal, and how adaptive techniques might improve this process.
- Improving Cache Efficiency: Developing strategies to enhance the cache performance of 1x1 convolutions.
- Alternative Initialization and Regularization Methods: Exploring other initialization methods and integrating batch normalization to further improve the training of compressed models from scratch.
Conclusion
The proposed one-shot whole network compression scheme represents a significant step towards making deep CNNs more viable for mobile applications. The method achieves substantial improvements in model size, runtime, and energy consumption with minimal loss in accuracy. This approach sets the stage for further advancements in the efficient deployment of deep learning models in resource-constrained environments.