Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms (2409.16694v2)

Published 25 Sep 2024 in cs.AI, cs.CL, and cs.LG
A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Abstract: LLMs have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

A Survey of Low-bit LLMs: Basics, Systems, and Algorithms

Introduction

LLMs have substantially improved natural language processing, attaining high performance across various tasks. Yet, the significant memory and computational requirements of LLMs present substantial challenges for their deployment, particularly in resource-constrained environments. This paper presents a comprehensive survey on low-bit quantization methods tailored for LLMs to reduce these challenges. It systematically explores the principles, system implementations, and algorithmic strategies essential for low-bit quantization of LLMs.

Basics of Low-bit LLMs

The fundamental aspect of reducing bit-width in LLMs is central to increasing their efficiency. The survey provides an overarching view of low-bit number formats, quantization granularity, and dynamic/static quantization.

Low-bit Number Formats

The paper outlines various low-bit formats and their efficacy in representing data within LLMs. The widely adopted formats include INT4, INT8, FP8, FP16, and BF16, each providing different trade-offs between range and precision. Custom formats like NormalFloat (NF), Adaptive Biased Float (Abfloat), and Student Float (SF) offer tailored solutions for specific quantization challenges in LLMs.

Quantization Granularity

Quantization granularity determines the level of detail captured during the quantization process. Categories include tensor-wise, token-wise, channel-wise, group-wise, and element-wise quantization. Each method provides a different balance between storage efficiency and model accuracy, with finer granularities retaining more information but incurring higher computational costs.

Dynamic and Static Quantization

Dynamic quantization, which calculates quantization parameters in real-time, and static quantization, which uses pre-calibrated parameters, are critically analyzed. These strategies differ in computational overhead and inference speed, with each method offering specific advantages depending on the scenario.

Frameworks and System Support

The survey details various frameworks and systems that facilitate the deployment of low-bit LLMs. It lists several prominent inference frameworks such as TensorRT-LLM, ONNX-Runtime, Transformers (HuggingFace), DeepSpeed-MII, and others. Each framework's support for different quantization algorithms, bitwidths, target platforms, and model families is explored, underlining the landscape’s diversity and the specific strengths of each system.

Weight-only Quantization

Weight-only quantization is pivotal for reducing data transmission costs within hierarchical cache structures, a significant bottleneck in LLM inference. Techniques such as sparsity-aware quantization and lookup tables for efficient dequantization (e.g., FP6-LLM, SpQR) are examined, demonstrating substantial reductions in memory usage and inference time.

Weight and Activation Quantization

Quantizing both weights and activations introduces additional computational steps, such as runtime quantization of activations and low-bit matrix multiplication. Custom implementations that accelerate these processes are highlighted, showcasing the advancements in specialized kernels and datatype conversions.

KV Cache Quantization

The survey further explores the techniques for compressing the KV cache—a critical component due to its significant memory consumption during sequence processing. Methods like 2-bit, 4-bit quantization for KV cache, and outlier mitigation are discussed, reflecting the ongoing efforts to optimize memory usage without compromising performance.

Quantization Strategies for Efficient LLM Training

The survey categorizes quantization strategies into quantization-aware training (QAT) and parameter-efficient fine-tuning (PEFT). QAT integrates quantization during the training phase, optimizing the model to perform well under low-bit constraints. EfficientQAT and BitDistiller are examples where training cost and memory usage are substantially reduced. PEFT methods such as QLoRA and OmniQuant, aim to fine-tune LLMs effectively within resource constraints, often yielding deployable quantized models with minimal performance loss.

Quantization Algorithms for Efficient LLM Inference

Post-Training Quantization (PTQ) techniques, a practical approach for pre-trained models, are extensively detailed. Key algorithms employing equivalent transformations (shifting, scaling, and rotation) and compensation strategies are examined. Techniques such as GPTQ, which compensates quantization errors using second-order information, demonstrate superior performance in low-bit settings. Mix-precision methods and advanced quantization forms like vector quantization are also detailed, revealing the nuanced balance between precision and performance.

Future Trends and Directions

Future advancements in quantization of LLMs are anticipated in several areas:

  • Quantization Techniques: Research is needed to uncover the origins of outliers within LLMs to push the boundaries of minimal bit representation further. Unified strategies for mixed-bit quantization and compression of KV caches are also crucial.
  • Model Architecture: Innovations will likely focus on quantizing multi-modal models and new model structures like Mixture of Experts (MOE). Exploring the relationship between model size and quantization can lead to further optimizations.
  • Hardware Design: Co-design of hardware and quantization algorithms is expected to yield systems supporting extremely low-bit quantization and training with lower precision, enhancing both efficiency and performance.

Conclusion

This survey articulates the multifaceted challenges and innovative solutions in low-bit quantization of LLMs. By synthesizing fundamental concepts, system support, and algorithmic strategies, it provides a valuable resource for advancing the development and application of quantization techniques in LLMs.

With the continuous evolution of LLMs and the growing demand for efficient deployment, the integration of cutting-edge quantization methods will play a critical role in enabling widespread access to advanced NLP capabilities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ruihao Gong (40 papers)
  2. Yifu Ding (28 papers)
  3. Zining Wang (16 papers)
  4. Chengtao Lv (7 papers)
  5. Xingyu Zheng (10 papers)
  6. Jinyang Du (1 paper)
  7. Haotong Qin (60 papers)
  8. Jinyang Guo (28 papers)
  9. Michele Magno (118 papers)
  10. Xianglong Liu (128 papers)
Citations (1)