Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models (2305.12356v1)

Published 21 May 2023 in cs.LG and cs.AI

Abstract: Efficient deployment of LLMs necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves state-of-the-art results in both weight-only (W-only) and weight-activation (WA) post-training quantization scenarios when tested on LLaMA across various tasks. In 4-bit W-only quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and with an order of magnitude faster quantization speed. While in 8-bit WA quantization, MoFQ significantly outperforms INT/FP-only methods, achieving performance close to the full precision model. Notably, MoFQ incurs no hardware overhead compared to INT/FP-only quantization, as the bit-width remains unchanged.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yijia Zhang (24 papers)
  2. Lingran Zhao (3 papers)
  3. Shijie Cao (20 papers)
  4. Wenqiang Wang (10 papers)
  5. Ting Cao (100 papers)
  6. Fan Yang (877 papers)
  7. Mao Yang (62 papers)
  8. Shanghang Zhang (172 papers)
  9. Ningyi Xu (16 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com