Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Post-training 4-bit quantization of convolution networks for rapid-deployment (1810.05723v3)

Published 2 Oct 2018 in cs.CV

Abstract: Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of intermediate results, but it often requires the full datasets and time-consuming fine tuning to recover the accuracy lost after quantization. This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset. We target the quantization of both activations and weights and suggest three complementary methods for minimizing quantization error at the tensor level, two of whom obtain a closed-form analytical solution. Combining these methods, our approach achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. The source code to replicate all experiments is available on GitHub: \url{https://github.com/submission2019/cnn-quantization}.

Post Training 4-bit Quantization of Convolutional Networks for Rapid Deployment

The paper introduces a novel method for 4-bit post-training quantization of convolutional networks, addressing the significant computational costs that accompany deep learning models, particularly Convolutional Neural Networks (CNNs). The focus is on reducing memory and power consumption while maintaining near state-of-the-art performance without requiring retraining or a full dataset—a common scenario faced in practical deployment scenarios due to data privacy or unavailability of training resources.

Core Contributions and Methodologies

The research presents three primary methods aimed at minimizing quantization error, particularly in the context of post-training quantization, where models are adjusted after initial training without further retraining:

  1. Analytical Clipping for Integer Quantization (ACIQ): This method involves optimizing the clipping of activation values. By clipping these values to a computed optimal threshold, this approach aims to minimize the mean-squared quantization error for tensor distributions, which often exhibit a bell-curve shape. The application of ACIQ shows an average improvement of 3.2% over baseline methods.
  2. Per-channel Bit Allocation: This approach introduces a novel policy for optimal bit-width allocation across different channels, based on the statistical properties of the input distribution. The key parameter here is the mean-square-error, the minimization of which is achieved through a proportional allocation of the quantization step size to the 23\frac{2}{3}-power of each channel's range. This results in an average improvement of 6.3% over baselines in weight quantization and 2.85% in activation quantization.
  3. Bias Correction: The paper analyzes and corrects inherent biases in quantized weights. By adjusting for the mean and variance distortions post-quantization, the authors report an improvement of approximately 6.0% in validation accuracy over 4-bit baselines.

Integration of Methods and Results

In their experimental evaluation, the authors employ a combination of ACIQ, bias correction, and per-channel bit allocation to quantize both weights and activations. This combined method demonstrates significant recovery of accuracy loss commonly associated with 4-bit quantization, achieving near-floating-point-level performance on several well-known ImageNet models such as ResNet, VGG, and Inception. The proposed methodologies importantly avoid the necessity of costly retraining, thereby facilitating rapid deployment.

Implications and Future Work

The implications of these findings are significant for the practical deployment of deep learning models, enabling developers to effectively utilize pre-trained models within resource-constrained environments such as edge devices without the need for data-heavy retraining processes. This addresses a crucial gap in industries reliant on off-the-shelf deep learning models where data accessibility is often restricted.

Moving forward, one potential area of exploration could be further reduction in precision below four bits, examining whether the proposed methods maintain their effectiveness as quantization levels approach the limits of hardware capacities. Additionally, extending these techniques to Recurrent Neural Networks (RNNs) or Transformer models could have wide-reaching implications across various domains.

In conclusion, the paper offers substantial contributions with methods that offer practical, theoretically grounded solutions to post-training quantization, paving the way for more efficient AI model deployment in real-world applications. The techniques balance computational efficiency with performance, a balance essential for contemporary AI advancements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ron Banner (20 papers)
  2. Yury Nahshan (6 papers)
  3. Elad Hoffer (23 papers)
  4. Daniel Soudry (76 papers)
Citations (87)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com