A Survey of Low-bit LLMs: Basics, Systems, and Algorithms
Introduction
LLMs have substantially improved natural language processing, attaining high performance across various tasks. Yet, the significant memory and computational requirements of LLMs present substantial challenges for their deployment, particularly in resource-constrained environments. This paper presents a comprehensive survey on low-bit quantization methods tailored for LLMs to reduce these challenges. It systematically explores the principles, system implementations, and algorithmic strategies essential for low-bit quantization of LLMs.
Basics of Low-bit LLMs
The fundamental aspect of reducing bit-width in LLMs is central to increasing their efficiency. The survey provides an overarching view of low-bit number formats, quantization granularity, and dynamic/static quantization.
Low-bit Number Formats
The paper outlines various low-bit formats and their efficacy in representing data within LLMs. The widely adopted formats include INT4, INT8, FP8, FP16, and BF16, each providing different trade-offs between range and precision. Custom formats like NormalFloat (NF), Adaptive Biased Float (Abfloat), and Student Float (SF) offer tailored solutions for specific quantization challenges in LLMs.
Quantization Granularity
Quantization granularity determines the level of detail captured during the quantization process. Categories include tensor-wise, token-wise, channel-wise, group-wise, and element-wise quantization. Each method provides a different balance between storage efficiency and model accuracy, with finer granularities retaining more information but incurring higher computational costs.
Dynamic and Static Quantization
Dynamic quantization, which calculates quantization parameters in real-time, and static quantization, which uses pre-calibrated parameters, are critically analyzed. These strategies differ in computational overhead and inference speed, with each method offering specific advantages depending on the scenario.
Frameworks and System Support
The survey details various frameworks and systems that facilitate the deployment of low-bit LLMs. It lists several prominent inference frameworks such as TensorRT-LLM, ONNX-Runtime, Transformers (HuggingFace), DeepSpeed-MII, and others. Each framework's support for different quantization algorithms, bitwidths, target platforms, and model families is explored, underlining the landscape’s diversity and the specific strengths of each system.
Weight-only Quantization
Weight-only quantization is pivotal for reducing data transmission costs within hierarchical cache structures, a significant bottleneck in LLM inference. Techniques such as sparsity-aware quantization and lookup tables for efficient dequantization (e.g., FP6-LLM, SpQR) are examined, demonstrating substantial reductions in memory usage and inference time.
Weight and Activation Quantization
Quantizing both weights and activations introduces additional computational steps, such as runtime quantization of activations and low-bit matrix multiplication. Custom implementations that accelerate these processes are highlighted, showcasing the advancements in specialized kernels and datatype conversions.
KV Cache Quantization
The survey further explores the techniques for compressing the KV cache—a critical component due to its significant memory consumption during sequence processing. Methods like 2-bit, 4-bit quantization for KV cache, and outlier mitigation are discussed, reflecting the ongoing efforts to optimize memory usage without compromising performance.
Quantization Strategies for Efficient LLM Training
The survey categorizes quantization strategies into quantization-aware training (QAT) and parameter-efficient fine-tuning (PEFT). QAT integrates quantization during the training phase, optimizing the model to perform well under low-bit constraints. EfficientQAT and BitDistiller are examples where training cost and memory usage are substantially reduced. PEFT methods such as QLoRA and OmniQuant, aim to fine-tune LLMs effectively within resource constraints, often yielding deployable quantized models with minimal performance loss.
Quantization Algorithms for Efficient LLM Inference
Post-Training Quantization (PTQ) techniques, a practical approach for pre-trained models, are extensively detailed. Key algorithms employing equivalent transformations (shifting, scaling, and rotation) and compensation strategies are examined. Techniques such as GPTQ, which compensates quantization errors using second-order information, demonstrate superior performance in low-bit settings. Mix-precision methods and advanced quantization forms like vector quantization are also detailed, revealing the nuanced balance between precision and performance.
Future Trends and Directions
Future advancements in quantization of LLMs are anticipated in several areas:
- Quantization Techniques: Research is needed to uncover the origins of outliers within LLMs to push the boundaries of minimal bit representation further. Unified strategies for mixed-bit quantization and compression of KV caches are also crucial.
- Model Architecture: Innovations will likely focus on quantizing multi-modal models and new model structures like Mixture of Experts (MOE). Exploring the relationship between model size and quantization can lead to further optimizations.
- Hardware Design: Co-design of hardware and quantization algorithms is expected to yield systems supporting extremely low-bit quantization and training with lower precision, enhancing both efficiency and performance.
Conclusion
This survey articulates the multifaceted challenges and innovative solutions in low-bit quantization of LLMs. By synthesizing fundamental concepts, system support, and algorithmic strategies, it provides a valuable resource for advancing the development and application of quantization techniques in LLMs.
With the continuous evolution of LLMs and the growing demand for efficient deployment, the integration of cutting-edge quantization methods will play a critical role in enabling widespread access to advanced NLP capabilities.