A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B (2409.11055v1)

Published 17 Sep 2024 in cs.CL and cs.AI

Abstract: Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

PDF HTML Abstract

A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: An Experimental Analysis up to 405B

The manuscript "A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: An Experimental Analysis up to 405B" authored by Jemin Lee et al., provides a meticulous analysis of the performance of instruction-tuned LLMs when subjected to various quantization methods. The paper explores quantization techniques such as GPTQ, AWQ, SmoothQuant, and FP8, examining models ranging from 7B to an unprecedented 405B parameters across 13 datasets encapsulating six task categories: commonsense QA, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue.

Key Insights

The evaluation presented in this paper offers several significant findings:

Quantized Models vs. Smaller Models: Quantizing larger LLMs to the same size as smaller FP16 LLMs generally results in better performance on most benchmarks. For instance, a 4-bit quantized Llama-2-13B outperforms the original Llama-2-7B in most tasks, despite retaining a smaller size post-quantization. However, the paper notes exceptions in the areas of hallucination detection and instruction following where smaller FP16 models perform better.
Divergent Performance Across Quantization Methods: The efficacy of quantization methods varies with model size and bit-width. Weight-only quantization methodologies, notably AWQ, tend to yield superior results compared to methods that quantize both weights and activations, particularly in larger models like the 405B parameter models. For instance, AWQ displays less performance degradation compared to GPTQ.
Task Difficulty and Accuracy Degradation: The paper finds no significant correlation between the difficulty of the tasks and the extent of accuracy degradation due to quantization. Complex tasks did not disproportionately impact the performance of quantized models more than simpler ones.
MT-Bench Evaluation: The MT-Bench evaluation method shows limited discriminatory power among high-performing recent LLMs. While 7B models see a performance uptick post-quantization, larger models (like 13B and 70B) exhibit reduced scores.

Detailed Experimental Setup

The paper outlines a robust evaluation pipeline involving multiple LLMs including Vicuna, Gemma, and the Llama families. These models, sized 2B to 405B, were subjected to four distinct quantization methods and evaluated in a multi-node cluster GPU environment using a meticulously designed pipeline implemented with libraries such as vLLM and huggingface accelerate. Table \ref{tab:benchmarks_summary} in the paper details the datasets, covering a wide range of evaluation abilities from knowledge and language understanding to mathematical reasoning and dialogue quality.

Key evaluation tools include lm-eval for standardized benchmarking and MT-Bench for multi-turn conversation evaluations. The results from these experiments provide a wealth of data under consistent conditions, ensuring rigour in comparisons.

Practical and Theoretical Implications

Practical Implications

From a practical perspective, this paper furnishes critical insights into deploying LLMs in resource-constrained environments. Given the substantial memory and computational demands of LLMs with billions of parameters, effective quantization methodologies such as those discussed (e.g., AWQ and FP8) can be pivotal in enabling high-performing, scalable AI applications. These findings are particularly salient for real-world applications where computational efficiency is paramount, such as edge devices and large-scale AI deployments.

Theoretical Implications

Theoretically, the paper advances understanding of the trade-offs involved in LLM quantization. It underscores the relative robustness of weight-only quantization and the challenges inherent in effectively handling high activation ranges in models as large as Llama-3.1-405B. The granular exploration of performance across varying bit-widths and model sizes contributes valuable empirical data to the discourse on neural network quantization, informing future research on optimizing quantization techniques for increasingly large and sophisticated models.

Future Directions

The authors hint at several avenues for future research. Refinements in quantization algorithms that better handle high activation ranges for very large LLMs could mitigate notable accuracy drops observed with methods like SmoothQuant. Additionally, comprehensive frameworks that can seamlessly apply the most effective quantization strategies dynamically, based on task and model characteristics, could also emerge as a beneficial area of exploration.

Lastly, further investigations into more fine-grained evaluation metrics beyond accuracy—such as robustness, fairness, and long-term stability of quantized LLMs—would be critical. Integrating these aspects could lead to the development of more holistic evaluation protocols that better capture the multifaceted demands placed on LLMs in dynamic, real-world scenarios.

Conclusion

This paper's extensive evaluation of quantized instruction-tuned LLMs illuminates crucial dynamics in the performance landscape of modern AI models. By rigorously probing various quantization methods across diverse tasks and model scales, the authors provide a foundational reference that will inform both the deployment and further enhancement of quantization techniques in large-scale AI. The results are poised to impact not just the immediate efficacy of deploying LLMs in computationally constrained contexts, but also the theoretical frameworks that guide the evolution of these quantization methods.