Introduction
The development of LLMs has catalyzed significant advancements in natural language processing. Notwithstanding their capabilities, the deployment of these models remains a considerable challenge due to their immense parameter size and computational demands. Binarization offers a compelling compression solution by reducing model weights to a single bit, thereby diminishing computational and memory requirements. However, existing Post-Training Quantization (PTQ) methods struggle with performance preservation at such low bit-widths. In a novel contribution, this paper introduces BiLLM, a sophisticated 1-bit PTQ framework engineered for LLMs, which leverages the weight distribution attributes in LLMs to optimize quantization.
Analyzing Weight Distribution
BiLLM's approach begins with an empirical analysis of LLM weight distributions, uncovering that a minority of weights significantly impact model output, while the majority suggest redundancy. Furthermore, non-salient weights display a bell-shaped distribution. These insights underpin BiLLM's innovative strategy: a bifurcated binarization process that targets both salient and non-salient weights with tailored methods.
Innovative Binarization Techniques
For salient weights, BiLLM employs a Hessian-based metric to guide structural selection and applies binary residual approximation for compression with minimal information loss. Non-salient weights are managed with an "optimal splitting" binarization method that separates weights based on distribution characteristics to minimize binarization errors. Error compensation mechanisms further enhance binarization accuracy, while the overall process remains time-efficient, binarizing a 7-billion-weight LLM in an astonishing 0.5 hours on a single GPU.
Proven by Extensive Experiments
The paper presents strong numerical evidence to back its claims. Notably, on certain benchmarks like WikiText2, BiLLM achieved perplexity scores that even surpass those of full-precision models, indicating minimal compromise on model performance. BiLLM's capability extends to various LLM families, and extensive experiments showcase its state-of-the-art results on several metrics. A remarkable instance is BiLLM achieving perplexity of 8.41 with only 1.08-bit weights for the LLaMA2-70B model, outperforming current LLM quantization methods.
Conclusion and Outlook
BiLLM stands out as a leading-edge framework in the field of LLM post-training quantization, accomplishing SOTA performance with an average weight of about 1.08-bit across different LLM families. Its ability to maintain high accuracy after quantization represents a breakthrough for deploying LLMs on resource-restricted systems. The discussed approach sets a new benchmark for LLM quantization efficiency and opens avenues for subsequent research and practical application in the wider deployment of LLMs.