PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs (2410.05265v1)

Published 7 Oct 2024 in cs.LG and cs.CL

Abstract: Quantization is essential for deploying LLMs by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at \url{https://github.com/ChenMnZ/PrefixQuant}.

PDF HTML Abstract

An Overview of PrefixQuant: Static Quantization in LLMs

The paper "PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs" presents a novel approach to quantization in LLMs by proposing a technique known as PrefixQuant. Quantization is a critical process for reducing the memory usage and enhancing the inference speed of LLMs, which are characterized by large parameters and computational demands. Existing quantization techniques often focus on channel-wise outliers, disregarding token-wise outliers, which has resulted in a reliance on costly per-token dynamic quantization methods.

Key Contributions

PrefixQuant introduces a static quantization approach by identifying and isolating outlier tokens offline, thereby eliminating the need for re-training. It strategically pre-processes high-frequency outlier tokens by prefixing them in the KV cache to prevent their generation during inference. This technique simplifies quantization by supporting per-tensor static quantization, which is computationally more efficient than per-token dynamic quantization.

Numerical Results

The paper presents strong numerical results demonstrating the efficacy of PrefixQuant. In the context of the W4A4KV4 quantization setting for the Llama-3-8B model, PrefixQuant achieves a WikiText2 perplexity of 7.43 and an average accuracy of 71.08% across five common-sense reasoning tasks. This performance surpasses previous methods, such as QuaRot, by 0.98 perplexity improvement and an increase of 5.98 accuracy points. Furthermore, PrefixQuant exhibits substantial inference speed improvements, ranging from 1.60x to 2.81x faster than FP16 models, and also surpasses QuaRot models by a factor of 1.2x to 1.3x.

Implications and Future Directions

The introduction of PrefixQuant has significant implications for the deployment of LLMs. By improving static quantization techniques, this approach enhances the inference efficiency and reduces computational overhead, particularly beneficial for real-time applications. The method's capability to outperform dynamic quantization methods without additional training underscores its potential for broader applicability in LLM compression and optimization research.

Additionally, PrefixQuant improves the stability of model training by minimizing the influence of large outliers during Mean Square Error (MSE) loss calculations, positioning itself as a plug-and-play component that can elevate existing optimization-based methods.

Future research avenues may explore further refinements to PrefixQuant's token isolation strategies and investigate its integration with other model compression techniques to maximize efficiency across diverse LLM architectures. The technique also opens possibilities for applying static quantization gains in other domains beyond LLMs.

In summary, PrefixQuant addresses a critical bottleneck in the deployment of LLMs by intelligently managing outlier tokens, thus paving the way for more efficient static quantization processes. This contribution represents a meaningful step towards optimizing the performance of LLMs in resource-constrained environments.