BiLLM: Pushing the Limit of Post-Training Quantization for LLMs (2402.04291v2)

Published 6 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Pretrained LLMs exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https://github.com/Aaronhuang-778/BiLLM.

PDF HTML Abstract

Introduction

The development of LLMs has catalyzed significant advancements in natural language processing. Notwithstanding their capabilities, the deployment of these models remains a considerable challenge due to their immense parameter size and computational demands. Binarization offers a compelling compression solution by reducing model weights to a single bit, thereby diminishing computational and memory requirements. However, existing Post-Training Quantization (PTQ) methods struggle with performance preservation at such low bit-widths. In a novel contribution, this paper introduces BiLLM, a sophisticated 1-bit PTQ framework engineered for LLMs, which leverages the weight distribution attributes in LLMs to optimize quantization.

Analyzing Weight Distribution

BiLLM's approach begins with an empirical analysis of LLM weight distributions, uncovering that a minority of weights significantly impact model output, while the majority suggest redundancy. Furthermore, non-salient weights display a bell-shaped distribution. These insights underpin BiLLM's innovative strategy: a bifurcated binarization process that targets both salient and non-salient weights with tailored methods.

Innovative Binarization Techniques

For salient weights, BiLLM employs a Hessian-based metric to guide structural selection and applies binary residual approximation for compression with minimal information loss. Non-salient weights are managed with an "optimal splitting" binarization method that separates weights based on distribution characteristics to minimize binarization errors. Error compensation mechanisms further enhance binarization accuracy, while the overall process remains time-efficient, binarizing a 7-billion-weight LLM in an astonishing 0.5 hours on a single GPU.

Proven by Extensive Experiments

The paper presents strong numerical evidence to back its claims. Notably, on certain benchmarks like WikiText2, BiLLM achieved perplexity scores that even surpass those of full-precision models, indicating minimal compromise on model performance. BiLLM's capability extends to various LLM families, and extensive experiments showcase its state-of-the-art results on several metrics. A remarkable instance is BiLLM achieving perplexity of 8.41 with only 1.08-bit weights for the LLaMA2-70B model, outperforming current LLM quantization methods.

Conclusion and Outlook

BiLLM stands out as a leading-edge framework in the field of LLM post-training quantization, accomplishing SOTA performance with an average weight of about 1.08-bit across different LLM families. Its ability to maintain high accuracy after quantization represents a breakthrough for deploying LLMs on resource-restricted systems. The discussed approach sets a new benchmark for LLM quantization efficiency and opens avenues for subsequent research and practical application in the wider deployment of LLMs.