Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip (2405.00645v2)

Published 1 May 2024 in cs.LG and physics.ins-det

Abstract: Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method that could fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations with an arbitrary number of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

References (65)

Authors (5)

Chang Sun (32 papers)
Thea K. Årrestad (1 paper)
Vladimir Loncar (32 papers)
Jennifer Ngadiuba (28 papers)
Maria Spiropulu (58 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/MuzafferKal_/status/1787500105869951270

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip (2405.00645v2)

Summary

Related Papers

Tweets