Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models (2405.17849v2)

Published 28 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of LLMs. Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xing Hu (122 papers)
  2. Dawei Yang (61 papers)
  3. Sifan Zhou (24 papers)
  4. Zhihang Yuan (45 papers)
  5. Jiangyong Yu (13 papers)
  6. Chen Xu (186 papers)
  7. Yuan Cheng (70 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets