Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization (2202.05239v1)

Published 10 Feb 2022 in cs.CV, cs.AI, cs.AR, cs.LG, and cs.NE

Abstract: Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qing Jin (17 papers)
  2. Jian Ren (97 papers)
  3. Richard Zhuang (4 papers)
  4. Sumant Hanumante (1 paper)
  5. Zhengang Li (31 papers)
  6. Zhiyu Chen (60 papers)
  7. Yanzhi Wang (197 papers)
  8. Kaiyuan Yang (32 papers)
  9. Sergey Tulyakov (108 papers)
Citations (41)