Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers (1802.00124v2)

Published 1 Feb 2018 in cs.LG

Abstract: Model pruning has become a useful technique that improves the computational efficiency of deep learning, making it possible to deploy solutions in resource-limited scenarios. A widely-used practice in relevant work assumes that a smaller-norm parameter or feature plays a less informative role at the inference time. In this paper, we propose a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that does not critically rely on this assumption. Instead, it focuses on direct simplification of the channel-to-channel computation graph of a CNN without the need of performing a computationally difficult and not-always-useful task of making high-dimensional tensors of CNN structured sparse. Our approach takes two stages: first to adopt an end-to- end stochastic training method that eventually forces the outputs of some channels to be constant, and then to prune those constant channels from the original neural network by adjusting the biases of their impacting layers such that the resulting compact model can be quickly fine-tuned. Our approach is mathematically appealing from an optimization perspective and easy to reproduce. We experimented our approach through several image learning benchmarks and demonstrate its interesting aspects and competitive performance.

PDF Abstract

An Insightful Review of "Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers"

The paper "Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers" offers a significant deviation from traditional channel pruning methodologies in the optimization of deep convolutional neural networks (CNNs). Rather than adhering to the commonly held assumption that smaller-norm parameters are less informative, it introduces a two-stage approach focusing on pruning channels based on their contribution to the overall computation graph efficiency.

Summary of the Approach

The proposed method stands on a two-stage procedural framework: Firstly, it employs an end-to-end stochastic training method that forces certain channels' outputs to remain constant. Subsequently, these constant channels are pruned from the computation graph, aided by adjusting biases within the influencing layers. Notably, this procedure does not impact the underpinning computational graph of the CNN.

Key Methodological Elements:

End-to-End Stochastic Training: This method chooses channels to prune by assessing constant output states induced in the training phase.
Bias Adjustment: Through bias modification in affecting layers, channels that deliver unvarying outputs are efficiently removed.
ISTA-Based Optimization: The Iterative Shrinking and Thresholding Algorithm (ISTA) is leveraged for updating the $\gamma$ parameters in batch normalization, promoting sparsity.
$\gamma$ - $W$ Rescaling Trick: The approach incorporates a rescaling technique to expediently reach a refined sparse solution, thus expediting the pruning process.

The introduction of these elements contributes to a compelling framework for resource-constrained deployment scenarios of CNNs, achieving a competitive balance between model compactness and performance.

Evaluation and Results

Empirical evaluation on standardized image classification benchmarks such as CIFAR-10 and ILSVRC2012 demonstrates the method’s competitiveness. Notably, the proposed method achieves high accuracy with substantial reduction in model parameters and computational cost, demonstrating pruning effectiveness.

Numerical Highlights:

On CIFAR-10, significant model parameter savings are achieved whilst minimally affecting accuracy. For instance, in the case of model B, a reduction from approximately 1.99 million to around 208 thousand parameters resulted in a minor accuracy drop from 89.0% to 87.6%.
For ILSVRC2012, high efficacy was retained (with only 0.5% increase in Top-5 error rate) while reducing the parameter count substantially from the baseline ResNet-101 model.

Noteworthy are the experimentations, which also apply the method to an inception-like segmentation model, presenting a practical application where such pruning not only saves resources but, intriguingly, achieves improved mean Intersection over Union (mIOU) scores on most benchmarks tested.

Theoretical and Practical Implications

The exquisite mathematical framework enhances the method's optimization credibility, dealing adeptly with potential numerical issues. Avoiding the smaller-norm-less-informative assumption inspires a reevaluation of traditional approaches where such theoretical adjustments could mitigate inherent inefficiencies.

Future Directions

The authors have opened a promising trajectory in model compression, laying groundwork for future explorations into more efficient information flow management in CNNs. Subsequent works may build atop this foundation, incorporating adaptive techniques for channel pruning whose decision process could become more data-driven and dynamic, reflecting the varying importances across diverse deployment contexts.

Conclusively, the presented approach not only enriches the channel pruning discourse but propels forward thinking in CNN efficiency optimization. As such, it constitutes a valuable addition to the extant body of knowledge on computational improvements in deep learning architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jianbo Ye (17 papers)
Xin Lu (165 papers)
Zhe Lin (163 papers)
James Z. Wang (36 papers)

Citations (398)

View on Semantic Scholar