Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models (2504.04823v1)

Published 7 Apr 2025 in cs.CL and cs.AI

Abstract: Recent advancements in reasoning LLMs have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of LLMs, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

PDF Abstract

Insights on Quantized Reasoning Models: Evaluating Performance and Challenges

The paper entitled "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models" delivers a meticulous analysis of the application of quantization techniques to reasoning LLMs (RLMs). It investigates how these compression methods affect the performance of LLMs, specifically focusing on reasoning capabilities.

Key Methodological Contributions

The authors undertake a comprehensive empirical paper examining various state-of-the-art quantization algorithms, applied to different pre-trained and distilled reasoning models, across numerous benchmarks. The focus lies on weight-only, weight-activation, and KV cache quantization, with bit-widths varying from 3 bits to 16 bits, and models varying from 1.5B to 70B parameters. They rigorously test these models on a suite of reasoning tasks ranging from mathematical problems to programming challenges.

Principal Findings

Lossless Quantization: The research identifies W8A8 bit-width as the optimal configuration for lossless quantization across all tasks, retaining model performance comparable to the original BF16 precision models. They also propose W4A16 for specific models and tasks, which still maintains near-lossless performance.
Task Difficulty and Model Size: The impact of quantization is more pronounced in difficult tasks, such as AIME-120, compared to simpler queries like those in GSM8K. Smaller models, like those in the 1.5B parameter range, are more susceptible to performance degradation under stringent quantization.
Quantization Algorithms: Among various algorithms assessed, AWQ and FlatQuant emerge as preferred choices for weight-only and weight-activation quantization, respectively. For KV cache quantization, QuaRot shows favorable outcomes except when faced with extreme outliers, as seen in some Qwen models.
Model Origins: Models based on distillation exhibit better resilience against quantization compared to those employing reinforcement learning techniques, revealing significant insights into how reasoning capabilities are retained post-compression.
Output Length: Surprisingly, quantized models generate outputs of comparable length to their full precision counterparts, contradicting concerns that reduced precision may inadvertently increase reasoning steps.
Scaling Effects: The paper illustrates beneficial scaling effects where larger quantized models achieve superior accuracy-latency trade-offs. It affirms the advantage of test-time scaling, albeit with diminishing returns as sequence lengths extend.

Implications and Forward-Looking Discussion

The findings have practical implications for deploying reasoning models in resource-constrained environments. Lossless quantization maintains operational integrity while reducing computational and storage burdens, a vital aspect for real-world applications where efficiency is paramount.

Theoretically, this research illuminates the intricate balance between model size, computation precision, and task complexity, accentuating that different models may require customized quantization strategies for optimal performance. The unexpected resilience of output length indicates potential unforeseen intricacies in model internal representations not fully understood yet.

For future explorations, understanding the causal relationships underpinning quantization effects on model reasoning processes remains a rich, largely unexplored avenue. Developing more advanced quantization techniques that mitigate or even leverage these effects could revolutionize how RLMs are deployed across various sectors.

In conclusion, the paper sheds light on important dynamics of quantized reasoning models, providing a strong foundation for future research aimed at enhancing efficiency without compromising accuracy. The collaborative insights from evaluating model origins, task difficulty, and compressing strategies offer a comprehensive guide for optimized reasoning model deployment in practical scenarios.