Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights (1702.03044v2)

Published 10 Feb 2017 in cs.CV, cs.AI, and cs.NE

Abstract: This paper presents incremental network quantization (INQ), a novel method, targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which are struggled in noticeable accuracy loss, our INQ has the potential to resolve this issue, as benefiting from two innovations. On one hand, we introduce three interdependent operations, namely weight partition, group-wise quantization and re-training. A well-proven measure is employed to divide the weights in each layer of a pre-trained CNN model into two disjoint groups. The weights in the first group are responsible to form a low-precision base, thus they are quantized by a variable-length encoding method. The weights in the other group are responsible to compensate for the accuracy loss from the quantization, thus they are the ones to be re-trained. On the other hand, these three operations are repeated on the latest re-trained group in an iterative manner until all the weights are converted into low-precision ones, acting as an incremental network quantization and accuracy enhancement procedure. Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed method. Specifically, at 5-bit quantization, our models have improved accuracy than the 32-bit floating-point references. Taking ResNet-18 as an example, we further show that our quantized models with 4-bit, 3-bit and 2-bit ternary weights have improved or very similar accuracy against its 32-bit floating-point baseline. Besides, impressive results with the combination of network pruning and INQ are also reported. The code is available at https://github.com/Zhouaojun/Incremental-Network-Quantization.

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

In the paper titled "Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights," Aojun Zhou et al. introduce a technique termed Incremental Network Quantization (INQ). This method is developed to efficiently transform pre-trained full-precision convolutional neural networks (CNNs) into their low-precision counterparts, specifically constraining the weights to be either powers of two or zero. This low-precision format allows for substantial computational efficiency improvements since the original floating-point multiplications can be replaced by binary bit-shift operations on dedicated hardware like FPGA.

Methodology and Innovations

INQ stands out by employing a three-pronged approach: weight partition, group-wise quantization, and re-training. These operations are executed iteratively until all weights are quantized, ensuring minimal accuracy loss—a significant challenge for existing quantization methods. A prominent feature of INQ is the use of a pruning-inspired measure to divide weights into two groups: one for forming the low-precision base via variable-length encoding and another for compensating accuracy loss through re-training. The incremental approach of repeatedly applying these operations ensures that the quantized model retains its accuracy over iterations.

Experimental Results

Extensive experiments validate INQ's efficacy on a breadth of deep CNN architectures including AlexNet, VGG-16, GoogleNet, and ResNets, tested on the ImageNet dataset. Notably, at 5-bit quantization, models converted using INQ consistently demonstrate improved accuracy compared to their 32-bit floating-point baselines. For instance, AlexNet sees a decrease in top-1 error rate from 42.76% to 42.61%, and VGG-16 shows an even more remarkable improvement from 31.46% to 29.18% in top-1 error rate and from 11.35% to 9.70% in top-5 error rate.

Analysis of Partition Strategies

The analysis includes evaluating weight partition strategies – random partitioning and pruning-inspired partitioning. It is observed that pruning-inspired partitioning, which accounts for weight importance (largely based on magnitude), outperforms random partitioning. For example, using pruning-inspired strategies on ResNet-18 results in a top-1 error rate of 31.02%, compared to 32.11% when using random partitioning, thus demonstrating its superiority.

Bit-width vs. Model Accuracy Trade-off

The authors explore the limit of bit-width, with detailed results showing that even with a reduction to 3-bit and 2-bit ternary weights, INQ maintains competitive accuracy levels relative to full-precision models. Particularly, for ResNet-18, the INQ-based 3-bit and 2-bit models show remarkable resilience, preserving the model's efficacy with minimal accuracy loss.

Implications in Network Compression

INQ's ability to integrate with network pruning methods for enhanced compression is discussed, and results show significant compression improvements compared to state-of-the-art deep compression techniques. For example, when combined with Dynamic Network Surgery (DNS), INQ achieves a compression ratio of 53x with a meager 0.08% increase in top-1 error, marking a substantial improvement over previous methods.

Future Directions

The primary future direction includes extending INQ to constrain not just weights but also activations and gradients, aiming for low-bit representations in these components as well. The paper already indicates initial success in applying INQ to weights and activations, suggesting potential significant advances in both network efficiency and performance.

Conclusion

INQ represents a notable advancement in the field of CNN quantization, offering a robust, iterative approach that preserves model accuracy while dramatically reducing bit-width requirements. The method's effectiveness across various architectures and its ease of convergence underscore its practical utility for deploying deep learning models on resource-constrained devices. Future work may extend INQ’s principles to other network components and hardware implementations, further enhancing its applicability and impact in efficient deep learning model deployment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aojun Zhou (45 papers)
  2. Anbang Yao (33 papers)
  3. Yiwen Guo (58 papers)
  4. Lin Xu (46 papers)
  5. Yurong Chen (43 papers)
Citations (1,025)