Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs (2311.12359v3)

Published 21 Nov 2023 in cs.CV, cs.AI, cs.AR, cs.LG, and cs.PF

Abstract: Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shivam Aggarwal (8 papers)
  2. Alessandro Pappalardo (7 papers)
  3. Hans Jakob Damsgaard (2 papers)
  4. Giuseppe Franco (3 papers)
  5. Thomas B. Preußer (11 papers)
  6. Michaela Blott (31 papers)
  7. Tulika Mitra (27 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com