Pruning and Quantization for Deep Neural Network Acceleration: A Survey (2101.09671v3)

Published 24 Jan 2021 in cs.CV and cs.AI

Abstract: Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.

PDF Abstract

Pruning and Quantization for Deep Neural Network Acceleration: An Overview

The paper by Tailin Liang et al. presents a comprehensive survey on network compression techniques, specifically focusing on pruning and quantization, to enhance deep neural network (DNN) acceleration. This research is timely given the proliferation of complex DNN architectures requiring efficient real-time deployment and optimized computational resources.

Pruning Techniques

Pruning involves the reduction of network parameters by removing those deemed redundant, which does not significantly affect accuracy. The paper categorizes pruning into static and dynamic types:

Static pruning is executed offline and involves criteria such as magnitude-based and penalty-based approaches. These methods aim to reduce network complexity by retaining only vital connections. Methods like LASSO and Group LASSO ensure structured pruning, facilitating efficient implementation on hardware due to regular network structures post-pruning.
Dynamic pruning occurs during runtime, offering the ability to adaptively prune elements based on input data. This technique enhances flexibility and can retain more network capability without sacrificing accuracy. However, dynamic pruning adds overhead due to real-time decision-making processes.

Practical guidance provided by the authors suggests combining static and dynamic pruning for optimal performance, leveraging strategies like retraining after pruning, and exploiting redundancy effectively.

Quantization Approaches

Quantization reduces the precision of weights, biases, and activations, typically converting FP32 parameters to low-bit integer representations such as INT8. The survey distinguishes between:

Quantization Aware Training (QAT): Involves training with quantized values to account for information loss and maintain accuracy. Techniques like stochastic quantization of gradients are emphasized due to their impact on maintaining network performance even with lower bit-widths.
Post Training Quantization (PTQ): This method quantizes trained models and typically involves calibration to adjust scales and maintain accuracy. Commonly, symmetric and asymmetric quantization are discussed with emphasis on the trade-offs between implementation complexity and precision.

Moreover, the authors highlight frameworks like TensorFlow-Lite and libraries optimized for various platforms, providing a useful overview of available tools for deploying low-precision networks efficiently.

Numerical Results and Implications

Quantitative comparisons across these techniques reveal significant findings. For instance, the adoption of 8-bit quantization typically yields a notable speedup (~2-3x) while maintaining acceptable accuracy losses (often less than 1%). Pruning, when combined with retraining, can drastically reduce the number of parameters without significant accuracy degradation, facilitating deployment in resource-constrained environments.

Theoretical and Practical Implications

The survey demonstrates crucial implications for both theory and practice. Theoretically, it emphasizes the role of redundancy within neural networks, suggesting that significant compression is achievable without compromising performance. Practically, it serves as a guide for implementing efficient DNN on edge and mobile devices where computational resources are limited.

Future Developments in AI

The paper concludes with future directions for research, including the integration of automated compression techniques and the application of these methods to other neural network types beyond CNNs. Additionally, the need for co-design between hardware and software to maximize efficiency is emphasized, potentially paving the way for innovative architectures optimized for low-precision execution.

Overall, the research by Liang et al. provides a detailed examination of pruning and quantization, presenting clear paths for accelerating DNNs effectively while maintaining accuracy, which remains relevant for advancing AI deployment in practical applications.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tailin Liang (2 papers)
John Glossner (4 papers)
Lei Wang (975 papers)
Shaobo Shi (2 papers)
Xiaotong Zhang (28 papers)

Citations (585)

View on Semantic Scholar