Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Potential of Flexible 8-bit Format: Design and Algorithm (2310.13513v2)

Published 20 Oct 2023 in cs.PF

Abstract: Neural network quantization is widely used to reduce model inference complexity in real-world deployments. However, traditional integer quantization suffers from accuracy degradation when adapting to various dynamic ranges. Recent research has focused on a new 8-bit format, FP8, with hardware support for both training and inference of neural networks but lacks guidance for hardware design. In this paper, we analyze the benefits of using FP8 quantization and provide a comprehensive comparison of FP8 with INT quantization. Then we propose a flexible mixed-precision quantization framework that supports various number systems, enabling optimal selection of the most appropriate quantization format for different neural network architectures. Experimental results demonstrate that our proposed framework achieves competitive performance compared to full precision on various tasks, including image classification, object detection, segmentation, and natural language understanding. Our work furnishes critical insights into the tangible benefits and feasibility of employing FP8 quantization, paving the way for heightened neural network efficiency in tangible scenarios. Our code is available in the supplementary material.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhuoyi Zhang (4 papers)
  2. Yunchen Zhang (14 papers)
  3. Gonglei Shi (4 papers)
  4. Yu Shen (56 papers)
  5. Ruihao Gong (40 papers)
  6. Xiaoxu Xia (1 paper)
  7. Qi Zhang (784 papers)
  8. Lewei Lu (55 papers)
  9. Xianglong Liu (128 papers)
Citations (1)