Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HAQ: Hardware-Aware Automated Quantization with Mixed Precision (1811.08886v3)

Published 21 Nov 2018 in cs.CV

Abstract: Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

Hardware-Aware Automated Quantization with Mixed Precision

The paper "HAQ: Hardware-Aware Automated Quantization with Mixed Precision," co-authored by Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han from the Massachusetts Institute of Technology, introduces a novel framework for the automated quantization of deep neural networks (DNNs). This framework, referred to as HAQ, utilizes reinforcement learning (RL) to determine optimal quantization policies that cater specifically to different hardware architectures, reducing both latency and energy consumption without significant loss of accuracy.

Overview

Quantization is widely recognized for its efficacy in compressing and accelerating DNN inference. Traditional quantization methods employ a uniform precision across all DNN layers, which often leads to sub-optimal performance due to the variance in computational and memory access characteristics between different layers. On the other hand, modern hardware accelerators now support mixed-precision quantization, presenting an opportunity to optimize the bitwidth of each layer individually. However, the vast design space and the need for domain expertise make this process challenging. The HAQ framework addresses these challenges by employing RL to automate the quantization process, incorporating direct feedback from hardware simulators to optimize latency and energy.

Methodology

The HAQ framework models the quantization problem as a RL task using an actor-critic model with a Deep Deterministic Policy Gradient (DDPG) agent. The RL agent processes each layer of the neural network sequentially, determining the bitwidth for both weights and activations based on a ten-dimensional feature vector capturing the layer’s characteristics.

To tailor the quantization policies to different hardware, HAQ incorporates direct feedback from hardware simulators, allowing it to optimize for actual latency and energy rather than relying on proxy signals like FLOPs. This feedback loop is critical, as it comprehensively captures the impact of hardware-specific operations on performance metrics, such as cache locality and memory bandwidth.

Results

The effectiveness of HAQ has been established through extensive experiments performed on MobileNets (V1 and V2) and ResNet-50 using the ImageNet dataset. HAQ was evaluated under various constraints including latency, energy, and model size. Key findings include:

  • Latency-Constrained Quantization: HAQ achieved a reduction in latency by a factor of 1.4 to 1.95× compared to traditional 8-bit quantization, without significant accuracy loss. Specific hardware architectures such as Bit-Serial Matrix Multiplication Overlay (BISMO) for edge and cloud accelerators, and BitFusion, benefited differently from the mixed precision approach. Notably, HAQ provided distinct bitwidth allocations tailored to the unique operational characteristics of edge versus cloud hardware.
  • Energy-Constrained Quantization: The framework effectively reduced energy consumption by 2× while maintaining accuracy, outperforming fixed bitwidth methods like PACT.
  • Model Size-Constrained Quantization: HAQ demonstrated superior performance compared to rule-based methods like Deep Compression, particularly under high compression ratios. For instance, with similar model sizes, HAQ preserved considerably higher accuracy by adjusting bitwidths dynamically across layers.

Implications and Future Directions

The implications of HAQ are substantial for both theoretical and practical developments in AI. The ability to automate quantization with hardware-aware optimization enables more efficient DNN deployment across a range of devices, from edge to cloud. The framework’s flexibility and adaptability suggest potential extensions, including:

  • Scaling to more complex models and diverse hardware architectures.
  • Integration with other optimization techniques like pruning and neural architecture search (NAS) to further streamline DNN deployment.
  • Refining RL-based approaches for even more granular hardware feedback, potentially leveraging emerging AI accelerators.

Conclusion

The HAQ framework represents a significant step forward in the automation of DNN quantization, effectively bridging the gap between model efficiency and hardware performance. By embedding hardware feedback into the RL-driven quantization process, HAQ provides specialized, optimized bitwidth policies that enhance both inference speed and energy efficiency across various hardware platforms. The nuanced insights derived from HAQ’s quantization policies offer valuable guidance for future co-design of neural network architectures and hardware accelerators.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kuan Wang (30 papers)
  2. Zhijian Liu (41 papers)
  3. Yujun Lin (23 papers)
  4. Ji Lin (47 papers)
  5. Song Han (155 papers)
Citations (820)