Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AFPQ: Asymmetric Floating Point Quantization for LLMs (2311.01792v1)

Published 3 Nov 2023 in cs.CL and cs.AI

Abstract: LLMs show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth. Low-bit weight quantization can save memory and accelerate inference. Although floating-point (FP) formats show good performance in LLM quantization, they tend to perform poorly with small group sizes or sub-4 bits. We find the reason is that the absence of asymmetry in previous FP quantization makes it unsuitable for handling asymmetric value distribution of LLM weight tensors. In this work, we propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values. Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ, for better performance. Besides, no additional storage is needed compared with asymmetric integer (INT) quantization. The code is available at https://github.com/zhangsichengsjtu/AFPQ.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yijia Zhang (24 papers)
  2. Sicheng Zhang (9 papers)
  3. Shijie Cao (20 papers)
  4. Dayou Du (11 papers)
  5. Jianyu Wei (5 papers)
  6. Ting Cao (100 papers)
  7. Ningyi Xu (16 papers)
Citations (4)