I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models (2405.17849v2)

Published 28 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of LLMs. Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xing Hu (122 papers)
Dawei Yang (61 papers)
Sifan Zhou (24 papers)
Zhihang Yuan (45 papers)
Jiangyong Yu (13 papers)
Chen Xu (186 papers)
Yuan Cheng (70 papers)

Citations (2)

View on Semantic Scholar

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models (2405.17849v2)

Related Papers

Tweets