Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Abstract
The paper presents a quantization scheme aimed at enabling efficient, integer-only arithmetic inference in deep learning models. This method addresses the limitations posed by floating-point operations, particularly in resource-constrained environments such as mobile devices. The authors combine this quantization scheme with a tailored training procedure that ensures minimal accuracy loss, thereby achieving a notably improved balance between model accuracy and on-device latency. This improvement is demonstrated on MobileNets, a family of models optimized for runtime efficiency, and validated using ImageNet classification and COCO detection tasks on common CPUs.
Introduction
Convolutional Neural Networks (CNNs) have revolutionized various tasks in computer vision, but their deployment on mobile devices remains challenging due to resource constraints. The paper targets this challenge by proposing an efficient quantization approach that translates models into integer-only arithmetic suitable for commonly available hardware. By leveraging integer operations, which are naturally supported and optimized on many CPUs, the authors demonstrate notable reductions in both computational cost and memory usage.
Quantization Scheme
The core of the proposed method is an affine quantization scheme mapping real-valued model parameters and activations to integer values. Specifically, the real value r is represented as follows: r=S(q−Z)
where q is the quantized integer, S is the scale, and Z is the zero-point. This technique ensures the zero-point stays demonstrable, which is crucial for optimizing padding operations often encountered in neural network processing. Both weights and activations are quantized as 8-bit integers, except for bias vectors which leverage 32-bit integers to avoid precision loss during accumulation.
Efficiency in Integer-Arithmetic
Integer-only inference involves efficiently implementing essential operations like convolution and matrix multiplication. The paper meticulously dissects the computational flow, demonstrating how all arithmetic can be constrained to integer operations while respecting the original model's behavior. Notably, they utilize NEON SIMD instructions on ARM architectures to accelerate these integer operations via optimized bit shifts and fixed-point arithmetic.
Training with Simulated Quantization
To maintain high accuracy post-quantization, the paper proposes a simulated quantization during the training phase. Here, floating-point arithmetic is used for backpropagation while the forward pass mimics the quantized inference process. This approach helps the model parameters adapt to quantization effects, significantly reducing the accuracy gap between the original and quantized models.
Experimental Results
The quantization method was tested rigorously on several benchmarks:
- ImageNet Classification: Quantized MobileNets were tested against their floating-point counterparts on multiple Qualcomm Snapdragon architectures. Results showed an advanced tradeoff in the latency-accuracy spectrum, particularly achieving notable performance improvements on the Qualcomm Snapdragon 835 LITTLE cores.
- COCO Detection: Quantized MobileNet SSD models were employed for real-time object detection, showing substantial latency reductions (up to 50%) with minimal drops in accuracy.
- Face Detection and Attribute Classification: Quantized models yielded near real-time performance on mobile devices, demonstrating the potential for practical deployment in real-time applications.
Implications and Future Work
The experimental success of quantized MobileNets underscores the practical implications of this research in real-world applications. The reduction in computational cost and memory footprint facilitates the deployment of deep learning models on edge devices, extending the scope of their applications in mobile and embedded systems. The approach also opens avenues for future research in further optimizing the quantization process, exploring mixed-precision arithmetic, and integrating similar techniques in more sophisticated model architectures and tasks beyond computer vision.
In conclusion, the proposed quantization scheme and the novel training approach present a significant step forward in improving the efficiency of neural network inference on resource-constrained platforms. The detailed experimental analysis offers strong evidence of the method’s effectiveness, which could inspire further optimization and adoption in various deployment scenarios.