Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications (1511.06530v2)

Published 20 Nov 2015 in cs.CV and cs.LG

Abstract: Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The proposed scheme consists of three steps: (1) rank selection with variational Bayesian matrix factorization, (2) Tucker decomposition on kernel tensor, and (3) fine-tuning to recover accumulated loss of accuracy, and each step can be easily implemented using publicly available tools. We demonstrate the effectiveness of the proposed scheme by testing the performance of various compressed CNNs (AlexNet, VGGS, GoogLeNet, and VGG-16) on the smartphone. Significant reductions in model size, runtime, and energy consumption are obtained, at the cost of small loss in accuracy. In addition, we address the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by our proposed scheme.

PDF Abstract

Overview of "Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications"

This paper addresses the problem of deploying deep Convolutional Neural Networks (CNNs) on mobile devices, which have limited computational power, battery, and memory capacity. The authors propose a novel approach called "one-shot whole network compression" for compressing entire CNNs to make them more suitable for mobile environments. This method involves three primary steps: rank selection with variational Bayesian matrix factorization (VBMF), Tucker decomposition on kernel tensor, and fine-tuning to mitigate the loss of accuracy.

Key Contributions

One-Shot Whole Network Compression Scheme:

The authors introduce a simple three-step process for compressing CNNs: - Rank Selection: Utilizing VBMF to determine the rank of each layer. - Tucker Decomposition: Applying Tucker decomposition to compress the kernel tensor of each layer. - Fine-Tuning: Reducing the accumulated loss of accuracy through fine-tuning.

Practical Implementation: Each step is implemented using publicly available tools: VBMF for rank determination, Tucker tensor toolbox for decomposition, and Caffe for fine-tuning. This ensures the approach is accessible for wide adoption.
Empirical Evaluation: The scheme's effectiveness is demonstrated on several popular CNN architectures, including AlexNet, VGG-S, GoogLeNet, and VGG-16, evaluated on both a high-performance GPU (Titan X) and a smartphone (Samsung Galaxy S6). The results show substantial reductions in model size, runtime, and energy consumption with minimal accuracy losses.

Experimental Results

The experimental results are noteworthy for four primary CNNs:

AlexNet: Achieved a 5.46x reduction in model size, 2.67x reduction in FLOPs, and 2.72x improvement in runtime on a smartphone, with a 1.70% accuracy loss.
VGG-S: Presented a 7.40x reduction in model size, 4.80x reduction in FLOPs, and 3.68x runtime improvement on a smartphone, with a mere 0.55% accuracy loss.
GoogLeNet: Demonstrated a 1.28x reduction in model size, 2.06x reduction in FLOPs, and 1.42x improvement in runtime on a mobile device, with a 0.24% accuracy loss.
VGG-16: Achieved a 1.09x reduction in model size, 4.93x reduction in FLOPs, and 3.34x increase in runtime on a smartphone, with a 0.50% accuracy loss.

Additionally, the fine-tuning process quickly recovered the accuracy drop caused by the compression, with substantial improvements achieved within the first epoch of fine-tuning.

Analysis

The detailed layer-wise analysis reveals that the proposed compression method is more effective on mobile devices than on high-performance GPUs due to reduced cache conflicts and memory latencies. This is particularly impactful for fully-connected layers, where the reduction in weights significantly enhances cache performance. Furthermore, the paper emphasizes the importance of the 1x1 convolution operation, widely used in both the inception modules of GoogLeNet and the compressed models, but noted for its cache inefficiency.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, this compression scheme provides an efficient method to deploy deep learning models on resource-constrained mobile devices without substantial accuracy loss. Theoretically, it presents a streamlined approach to whole network compression, combining VBMF and Tucker decomposition in a practical framework.

Future research could explore the following:

Optimal Rank Selection: Further investigation into whether the chosen ranks via VBMF are indeed optimal, and how adaptive techniques might improve this process.
Improving Cache Efficiency: Developing strategies to enhance the cache performance of 1x1 convolutions.
Alternative Initialization and Regularization Methods: Exploring other initialization methods and integrating batch normalization to further improve the training of compressed models from scratch.

Conclusion

The proposed one-shot whole network compression scheme represents a significant step towards making deep CNNs more viable for mobile applications. The method achieves substantial improvements in model size, runtime, and energy consumption with minimal loss in accuracy. This approach sets the stage for further advancements in the efficient deployment of deep learning models in resource-constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yong-Deok Kim (2 papers)
Eunhyeok Park (28 papers)
Sungjoo Yoo (25 papers)
Taelim Choi (1 paper)
Lu Yang (82 papers)
Dongjun Shin (1 paper)

Citations (873)

View on Semantic Scholar