Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation (2004.09602v1)

Published 20 Apr 2020 in cs.LG and stat.ML

Abstract: Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.

PDF Abstract

An Evaluation of Integer Quantization in Deep Learning Inference

Integer quantization has emerged as a pivotal mechanism for optimizing deep learning inference, enabling the deployment of neural networks with reduced precision while maintaining computational efficiency. This paper undertakes a comprehensive paper of integer quantization, mainly focusing on 8-bit quantization, for a diverse set of deep learning models covering image classification, object detection, segmentation, translation, speech recognition, and LLMs.

Mathematical Foundations and Practical Implications

The paper meticulously outlines the mathematical underpinnings of integer quantization processes, covering both affine and scale quantization methods. Affine quantization includes a zero-point offset, enabling an asymmetric mapping between floating-point numbers and integers. In contrast, scale quantization is a symmetric approach, which more directly facilitates the use of integer-only pipelines in matrix multiplications and convolutions.

An essential aspect of the paper is its empirical evaluation across a broad range of neural network architectures, applying quantization techniques to understand the trade-offs between accuracy and computational performance. The authors strongly argue in favor of scale quantization with symmetric integer representation for weights, preferring max calibration for weight quantization. This approach is found to sufficiently maintain accuracy across networks when employing per-channel granularity.

Post Training Quantization and Activation Calibration

The paper evaluates post-training quantization (PTQ) methods, reviewing several activations calibration techniques, including max, entropy, and various percentile-based calibrations. The analysis reveals that while PTQ can achieve notable accuracy preservation in many models, more intricate networks like MobileNets, EfficientNets, Transformers, and BERT undergo significant accuracy drops, sometimes exceeding 1%.

The implications of selecting appropriate activation calibration methods are underscored as crucial for achieving competitive accuracy. Entropy and high-percentile calibrations like 99.99% or 99.999% are recommended where PTQ alone is insufficient, although these methods may not be universally applicable to all network architectures.

Enhancements through Quantization-Aware Training

This paper introduces quantization-aware training (QAT) as a method to remedy the accuracy loss seen in PTQ. By simulating quantization effects during training, QAT allows networks to adapt to low precision, often recovering or even slightly improving original floating-point accuracy. The use of the Straight-through Estimator (STE) addresses the challenge of computing gradients for discretely quantized representations, approximating continuous gradient flow during backpropagation.

Quantization-aware training results in improved accuracy across most networks studied. Notably, the incorporation of PACT, which learns activation ranges during QAT, offers potential benefits, though its distinct advantages over fixed calibration do not manifest strongly when starting from carefully set activation ranges.

Recommendations for Integer Quantization Workflow

The authors propose a detailed workflow for implementing 8-bit integer quantization in neural network inference, emphasizing a preference for scale quantization of weights with per-channel granularity, and careful calibration of activations. This workflow efficiently balances between ease-of-implementation, computational efficiency, and model accuracy, demonstrated through empirical results across a wide array of deep learning models.

Conclusion and Future Directions

Through this rigorous evaluation, the paper provides actionable insights into quantization strategies for deploying efficient deep learning models in resource-constrained environments. The findings suggest avenues for future investigation, particularly in the field of even lower-bit quantization and more intricate quantization-aware training techniques. As models grow in complexity, quantization methods must evolve to ensure optimal performance without sacrificing the semantic integrity of deep learning models.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Hao Wu (623 papers)
Patrick Judd (9 papers)
Xiaojie Zhang (14 papers)
Mikhail Isaev (36 papers)
Paulius Micikevicius (9 papers)

Citations (285)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos