HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision (1905.03696v1)

Published 29 Apr 2019 in cs.CV

Abstract: Model size and inference speed/power have become a major challenge in the deployment of Neural Networks for many applications. A promising approach to address these problems is quantization. However, uniformly quantizing a model to ultra low precision leads to significant accuracy degradation. A novel solution for this is to use mixed-precision quantization, as some parts of the network may allow lower precision as compared to other layers. However, there is no systematic way to determine the precision of different layers. A brute force approach is not feasible for deep networks, as the search space for mixed-precision is exponential in the number of layers. Another challenge is a similar factorial complexity for determining block-wise fine-tuning order when quantizing the model to a target precision. Here, we introduce Hessian AWare Quantization (HAWQ), a novel second-order quantization method to address these problems. HAWQ allows for the automatic selection of the relative quantization precision of each layer, based on the layer's Hessian spectrum. Moreover, HAWQ provides a deterministic fine-tuning order for quantizing layers, based on second-order information. We show the results of our method on Cifar-10 using ResNet20, and on ImageNet using Inception-V3, ResNet50 and SqueezeNext models. Comparing HAWQ with state-of-the-art shows that we can achieve similar/better accuracy with $8\times$ activation compression ratio on ResNet20, as compared to DNAS~\cite{wu2018mixed}, and up to $1\%$ higher accuracy with up to $14\%$ smaller models on ResNet50 and Inception-V3, compared to recently proposed methods of RVQuant~\cite{park2018value} and HAQ~\cite{wang2018haq}. Furthermore, we show that we can quantize SqueezeNext to just 1MB model size while achieving above $68\%$ top1 accuracy on ImageNet.

PDF Abstract

HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision

The paper "HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision" presents a novel approach to mixed-precision quantization which improves the efficiency of neural network deployment without substantially sacrificing accuracy. Authored by researchers from the University of California, Berkeley, the paper investigates the computational and storage benefits of employing quantization in deep learning models, particularly focusing on leveraging the Hessian matrix for more informed precision allocation across neural layers.

Methodology

The core contribution of this paper is the HAWQ algorithm, which introduces a Hessian-based method to determine the sensitivity of each layer in a neural network concerning quantization. By calculating the second-order derivatives via the Hessian matrix, the authors aim to identify which layers can tolerate reduced precision without significant impacts on the overall model performance. This sensitivity-guided approach allows for a more adaptive quantization scheme where more critical layers are preserved with higher precision, while less critical layers are compressed more aggressively.

The authors employ mixed-precision, a strategy that assigns different precision levels to different parts of the model. This mixed-precision approach is informed by the Hessian analysis and aims to optimize the trade-off between model size and inference accuracy. The implementation involves a meticulous analysis of layer-wise sensitivity and the subsequent allocation of bit-widths to minimize the loss in model accuracy.

Results

The paper reports strong empirical results demonstrating the effectiveness of the HAWQ quantization approach. It is shown to consistently outperform fixed-precision quantization methods across several benchmark neural networks and datasets, including ResNet on ImageNet and BERT on NLP tasks. The results indicate that HAWQ achieves comparable accuracy to full-precision models while significantly reducing the memory footprint and computational cost. Specifically, some configurations of HAWQ exhibit reductions in model size by over 50% with minimal impact on accuracy.

Additionally, the paper includes ablation experiments assessing the impact of different components of the HAWQ algorithm, further solidifying the robustness of their approach. The findings illustrate how the Hessian-guided strategy can effectively prioritize precision where it is most needed, underscoring the potential for this technique to enhance neural network optimization processes.

Implications and Future Work

The proposed HAWQ approach has significant implications for both the practical deployment and theoretical understanding of quantization in deep learning. Practically, the method provides a pathway to deploy high-performance models in resource-constrained environments, thus extending the applicability of state-of-the-art neural networks to mobile and edge devices. Theoretically, the paper contributes to the burgeoning discourse on optimizing neural representations, suggesting more nuanced ways to leverage neural network structure for improved efficiency.

Future work may explore the integration of HAWQ with hardware-specific optimizations, potentially exploring FPGA or ASIC implementations that could amplify the benefits of mixed-precision quantization. Further exploration may also examine the applicability of Hessian-aware strategies in other areas of neural network optimization, such as pruning or neural architecture search (NAS), potentially setting a precedent for more adaptive, data-driven optimization frameworks in machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zhen Dong (87 papers)
Zhewei Yao (64 papers)
Amir Gholami (60 papers)
Michael Mahoney (18 papers)
Kurt Keutzer (200 papers)

Citations (469)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos