Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FPTQ: Fine-grained Post-Training Quantization for Large Language Models (2308.15987v1)

Published 30 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: In the era of large-scale LLMs, the substantial parameter size poses significant challenges for deployment. Being a prevalent compression technique, quantization has emerged as the mainstream practice to tackle this issue, which is mainly centered on two recipes W8A8 and W4A16 (i.e. weights and activations in such bit widths). In this study, we propose a novel W4A8 post-training quantization method for the available open-sourced LLMs, which combines the advantages of both two recipes. Therefore, we can leverage the benefit in the I/O utilization of 4-bit weight quantization and the acceleration due to 8-bit matrix computation. Nevertheless, the W4A8 faces notorious performance degradation. As a remedy, we involve layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers, and we combine them with fine-grained weight quantization. Without whistles and bells, we eliminate the necessity for further fine-tuning and obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks. We confirm that the W4A8 quantization is achievable for the deployment of LLMs, fostering their wide-spreading real-world applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qingyuan Li (11 papers)
  2. Yifan Zhang (245 papers)
  3. Liang Li (297 papers)
  4. Peng Yao (16 papers)
  5. Bo Zhang (633 papers)
  6. Xiangxiang Chu (62 papers)
  7. Yerui Sun (4 papers)
  8. Li Du (72 papers)
  9. Yuchen Xie (12 papers)
Citations (11)