OneBit: Towards Extremely Low-bit Large Language Models (2402.11295v6)

Published 17 Feb 2024 in cs.CL

Abstract: Model quantification uses low bit-width values to represent the weight matrices of existing models to be quantized, which is a promising approach to reduce both storage and computational overheads of deploying highly anticipated LLMs. However, current quantization methods suffer severe performance degradation when the bit-width is extremely reduced, and thus focus on utilizing 4-bit or 8-bit values to quantize models. This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. For this target, we introduce a 1-bit model compressing framework named OneBit, including a novel 1-bit parameter representation method to better quantize LLMs as well as an effective parameter initialization method based on matrix decomposition to improve the convergence speed of the quantization framework. Sufficient experimental results indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes when only using 1-bit weight matrices.

PDF Abstract

Exploring the Frontiers of 1-bit Quantization for LLMs with OneBit

Introduction to Quantization in LLMs

The deployment of LLMs in practical applications has been hampered by their substantial computational and memory requirements. Quantization, specifically post-training quantization (PTQ) and quantization-aware training (QAT), has emerged as a promising approach to alleviate these issues by compressing model weights into lower bit-width formats. While these techniques have shown success in compressing weights to 4 bits without significant loss in model capabilities, the extreme quantization to 1-bit presents considerable challenges, primarily due to the drastic precision loss.

OneBit: A Novel 1-bit Quantization Framework

In addressing the limitations of existing quantization methods at extremely low bit-widths, the paper introduces OneBit, a pioneering 1-bit quantization-aware training framework designed for LLMs. OneBit uniquely combines a novel 1-bit parameter representation method with an effective parameter initialization strategy based on matrix decomposition, aiming to mitigate the performance degradation associated with significant precision reduction. The key contributions of OneBit include:

A 1-bit model architecture that enhances time and space efficiency during model inference while ensuring more stable quantization of LLMs.
The introduction of Sign-Value-Independent Decomposition (SVID) as a novel approach for the initial decomposition of high-bit matrices into 1-bit representations, ensuring effective model performance and faster convergence.
Demonstrated effectiveness of OneBit across various models and tasks, showing compelling performance retention at the 1-bit quantization level, with results indicating at least 83% of the non-quantized performance alongside robust training processes.

Comparative Performance and Advantages

Experimental evaluations reveal OneBit's superior performance over existing quantization methods under the 2-bit setting. Across a range of models and tasks, OneBit not only outperforms these methods but also showcases minimal performance degradation, thus underpinning the feasibility and efficacy of 1-bit quantization for LLM deployment. Moreover, the OneBit framework exemplifies a favorable balance between reduced model size and maintained performance, opening new avenues for deploying advanced language processing capabilities on resource-constrained platforms.

Theoretical and Practical Implications

The development and validation of OneBit hold significant implications for both the theoretical understanding and practical deployment of LLMs. Theoretically, it challenges the conventional limitations perceived in extreme quantization scenarios, offering insights into the resilience of LLM architectures to precision loss. Practically, OneBit provides a viable pathway to deploying sophisticated LLMs in constrained environments, thereby broadening the accessibility and applicability of these models.

Future Directions in AI and Quantization

Looking ahead, the advancements presented in this work serve as a foundation for further exploration into ultra-low-bit quantization techniques. Future research may delve into combining 1-bit weight quantization with activation quantization, exploring alternative decomposition strategies, and extending the application of such quantization techniques beyond LLMs to other domains of deep learning. Furthermore, addressing the trade-offs between quantization-induced efficiency gains and performance metrics remains a critical area for continued innovation.

In conclusion, OneBit represents a significant stride toward making LLMs more accessible and versatile, encouraging the pursuit of even more ambitious quantization frameworks that could one day enable the deployment of AI systems across a spectrum of devices and platforms, from high-end GPUs to everyday mobile devices. Through such innovations, the field edges closer to realizing the full potential of AI in a wide array of applications and contexts.