HAWQV3: Dyadic Neural Network Quantization (2011.10680v3)

Published 20 Nov 2020 in cs.CV

Abstract: Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced.

PDF Abstract

An Expert Analysis of "HAWQ-V3: Dyadic Neural Network Quantization"

The paper "HAWQ-V3: Dyadic Neural Network Quantization" introduces an innovative approach to low-precision quantization of neural networks (NNs) by implementing a mixed-precision, integer-only quantization framework. The framework, termed HAWQ-V3, addresses the hidden computational costs associated with floating point operations in existing quantization methods by eliminating floating point operations entirely from the inference process. This methodology is significant given the increasing deployment of deep learning models in resource-constrained environments, such as edge devices and low-power hardware.

Key Contributions

The primary contributions of the HAWQ-V3 framework are fourfold:

Integer-Only Inference: HAWQ-V3 establishes an integer-only inference paradigm where all computational operations—multiplication, addition, and bit shifting—are performed without floating point or integer division operations. This integer-only approach extends to components traditionally reliant on floating point arithmetic, such as batch normalization (BN) layers and residual connections, ensuring compatibility with integer-only hardware constraints.
Hardware-Aware Mixed-Precision Quantization: The framework introduces a novel method to determine layer-specific bit precision using an integer linear programming (ILP) formulation. This approach optimizes the trade-off between model perturbation and hardware constraints, such as memory footprint and latency, leading to more efficient deployments.
Direct Hardware Deployment and Open Sourcing: HAWQ-V3 includes direct deployment capabilities on hardware, particularly demonstrating performance on NVIDIA's T4 GPU using Apache TVM. This aspect evidences a practical speedup of 1.45× using uniform 4-bit quantization versus uniform 8-bit quantization for ResNet50, presenting potential for real-world applications.
Extensive Evaluation: The paper reports comprehensive evaluations on popular architectures like ResNet18/50 and InceptionV3. Notably, the 8-bit quantization achieved an accuracy of 77.58% for ResNet50, surpassing prior integer-only models by 2.68%, while mixed-precision INT4/8 quantization showcased a 23% reduction in INT8 latency with a maintained accuracy of 76.73%.

Implications and Future Developments

The implications of this research are multifaceted. Practically, the integer-only approach greatly enhances the feasibility of deploying deep learning models on devices with limited computational resources, potentially benefiting industries ranging from healthcare to autonomous vehicles. Theoretically, the proposed hardware-aware bit precision computation via ILP may stimulate further research into optimization frameworks that align model architectures with hardware capabilities, possibly within an automated neural architecture search context.

Future developments in AI are likely to explore the integration of similar quantization approaches with other model compression techniques, such as pruning and knowledge distillation, to further optimize deployment on a variety of hardware platforms. Additionally, extending the framework to more diverse NN architectures, such as those involved in transformer-based models, could further propel this line of research.

In conclusion, HAWQ-V3 represents a substantive step forward in neural network quantization, offering a sophisticated yet practical approach to achieving efficient model deployment. The open sourcing of both the framework and its hardware implementation underscores its potential for adoption and adaptation within the broader research community. As deep learning continues to advance, such frameworks will become increasingly critical for bridging the gap between state-of-the-art model performance and practical, real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zhewei Yao (64 papers)
Zhen Dong (87 papers)
Zhangcheng Zheng (1 paper)
Amir Gholami (60 papers)
Jiali Yu (5 papers)
Eric Tan (2 papers)
Leyuan Wang (15 papers)
Qijing Huang (14 papers)
Yida Wang (62 papers)
Michael W. Mahoney (233 papers)
Kurt Keutzer (199 papers)

Citations (82)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos