Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights
In the paper titled "Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights," Aojun Zhou et al. introduce a technique termed Incremental Network Quantization (INQ). This method is developed to efficiently transform pre-trained full-precision convolutional neural networks (CNNs) into their low-precision counterparts, specifically constraining the weights to be either powers of two or zero. This low-precision format allows for substantial computational efficiency improvements since the original floating-point multiplications can be replaced by binary bit-shift operations on dedicated hardware like FPGA.
Methodology and Innovations
INQ stands out by employing a three-pronged approach: weight partition, group-wise quantization, and re-training. These operations are executed iteratively until all weights are quantized, ensuring minimal accuracy loss—a significant challenge for existing quantization methods. A prominent feature of INQ is the use of a pruning-inspired measure to divide weights into two groups: one for forming the low-precision base via variable-length encoding and another for compensating accuracy loss through re-training. The incremental approach of repeatedly applying these operations ensures that the quantized model retains its accuracy over iterations.
Experimental Results
Extensive experiments validate INQ's efficacy on a breadth of deep CNN architectures including AlexNet, VGG-16, GoogleNet, and ResNets, tested on the ImageNet dataset. Notably, at 5-bit quantization, models converted using INQ consistently demonstrate improved accuracy compared to their 32-bit floating-point baselines. For instance, AlexNet sees a decrease in top-1 error rate from 42.76% to 42.61%, and VGG-16 shows an even more remarkable improvement from 31.46% to 29.18% in top-1 error rate and from 11.35% to 9.70% in top-5 error rate.
Analysis of Partition Strategies
The analysis includes evaluating weight partition strategies – random partitioning and pruning-inspired partitioning. It is observed that pruning-inspired partitioning, which accounts for weight importance (largely based on magnitude), outperforms random partitioning. For example, using pruning-inspired strategies on ResNet-18 results in a top-1 error rate of 31.02%, compared to 32.11% when using random partitioning, thus demonstrating its superiority.
Bit-width vs. Model Accuracy Trade-off
The authors explore the limit of bit-width, with detailed results showing that even with a reduction to 3-bit and 2-bit ternary weights, INQ maintains competitive accuracy levels relative to full-precision models. Particularly, for ResNet-18, the INQ-based 3-bit and 2-bit models show remarkable resilience, preserving the model's efficacy with minimal accuracy loss.
Implications in Network Compression
INQ's ability to integrate with network pruning methods for enhanced compression is discussed, and results show significant compression improvements compared to state-of-the-art deep compression techniques. For example, when combined with Dynamic Network Surgery (DNS), INQ achieves a compression ratio of 53x with a meager 0.08% increase in top-1 error, marking a substantial improvement over previous methods.
Future Directions
The primary future direction includes extending INQ to constrain not just weights but also activations and gradients, aiming for low-bit representations in these components as well. The paper already indicates initial success in applying INQ to weights and activations, suggesting potential significant advances in both network efficiency and performance.
Conclusion
INQ represents a notable advancement in the field of CNN quantization, offering a robust, iterative approach that preserves model accuracy while dramatically reducing bit-width requirements. The method's effectiveness across various architectures and its ease of convergence underscore its practical utility for deploying deep learning models on resource-constrained devices. Future work may extend INQ’s principles to other network components and hardware implementations, further enhancing its applicability and impact in efficient deep learning model deployment.