Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Quantized Large Language Models (2402.18158v2)

Published 28 Feb 2024 in cs.CL and cs.AI

Abstract: Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of LLMs. Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions. The code can be found in https://github.com/thu-nics/qLLM-eval.

An Expert Evaluation of Quantized LLMs

The exponential growth in the size and capabilities of LLMs has yielded significant advancements in NLP tasks. However, the deployment of these models is computationally expensive, primarily due to their sheer size and resource demands. Post-Training Quantization (PTQ) emerges as a promising approach to alleviate these computational burdens by reducing the model's memory footprint and operational complexity. This paper meticulously evaluates PTQ's efficacy across varying model families and task types, offering valuable insights into the practical applicability of quantization techniques.

Comprehensive Scope of Evaluation

The authors explore the effects of PTQ on 11 model families, ranging from OPT to the expansive Mamba model with parameters extending up to 180 billion. These models undergo evaluation across five distinct task categories: basic NLP tasks, emergent abilities, trustworthiness, dialogue, and long-context challenges. This comprehensive evaluation enables a thorough understanding of how different quantization strategies influence model performance across a spectrum of real-world applications.

Insights into Model and Task Sensitivity

The paper reveals intricate details about the sensitivity of LLMs to quantization across various tensor types—Weights, Activations, and KV Cache. An intriguing observation is that larger models tend to tolerate Weight-only and KV Cache quantization better than smaller ones, an insight that can guide model deployment strategies in resource-constrained environments. In contrast, Activation quantization appears less forgiving in larger models, suggesting a need for differentiated approaches based on model size and target applications.

Additionally, the evaluation highlights disparate impacts of quantization on task performance. While Weight-only and KV Cache quantization generally maintain performance across tasks, Activation quantization tends to degrade capabilities, especially in tasks involving emergent abilities and complex reasoning. This insight could be pivotal in optimizing models for specific applications that prioritize different aspects of performance, such as dialogue coherence or ethical reasoning.

Practical Guidelines and Algorithmic Advancements

Drawing upon the extensive experimental data, the paper offers actionable recommendations for applying quantization techniques to LLMs. For instance, it suggests that quantizing to W4, W4A8, and KV4 can broadly preserve performance in most tasks, providing a baseline for efficient deployment without significant accuracy loss. For memory-intensive scenarios, leveraging larger models with finer quantization (e.g., W3) could be advantageous, underscoring the importance of task and context specificity in deployment decisions.

State-of-the-art quantization methods like AWQ and SmoothQuant are rigorously evaluated, revealing their potential to partially mitigate performance loss in moderate quantization settings (such as W3) but also highlighting their limitations under extreme low-bit quantization. These findings illuminate future directions for improving quantization algorithms to achieve near-lossless performance restoration, expanding the applicability of PTQ in more demanding use cases.

Theoretical and Future Implications

The implications of this research extend beyond immediate practical applications. The paper prompts a re-evaluation of assumptions regarding LLM deployment and advocates for a nuanced understanding of how quantization interacts with model architecture and task demands. The results open avenues for further exploration into adaptive quantization strategies that align quantization granularity and method with specific task requirements or model characteristics.

Looking ahead, the paper's insights could catalyze the development of hybrid approaches that incorporate both quantization and other model compression techniques to achieve optimal performance-resource trade-offs. The complexity of emergent abilities and instruction-following tasks underlines the necessity for continuous innovation in designing models that balance efficiency with the sophistication required by advanced NLP tasks.

In conclusion, this detailed evaluation of quantized LLMs sheds light on the nuanced interplay between model performance, computational efficiency, and task specificity. It equips researchers and practitioners with a robust framework for leveraging PTQ to enhance the accessibility and usability of LLMs, paving the way for broader deployment across diverse, resource-constrained settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shiyao Li (17 papers)
  2. Xuefei Ning (52 papers)
  3. Luning Wang (8 papers)
  4. Tengxuan Liu (2 papers)
  5. Xiangsheng Shi (2 papers)
  6. Shengen Yan (26 papers)
  7. Guohao Dai (51 papers)
  8. Huazhong Yang (80 papers)
  9. Yu Wang (939 papers)
Citations (28)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

HackerNews