Outlier Suppression in Low-bit Transformer LLMs
This paper addresses a critical challenge in transformer quantization, focusing on the suppression of structured outliers to enhance the performance of low-bit transformer LLMs beyond existing limitations. The growing use of transformer architectures for NLP tasks brings significant computational and memory demands, necessitating efficient deployment strategies like model quantization. However, quantizing transformers, particularly at low-bit precisions, encounters substantial difficulties due to structured outliers, which can severely degrade performance, often leaving models operating at higher bit levels like 8-bit with notable accuracy drops.
Key Contributions:
- Outlier Analysis: The paper first dissects the source of outliers in transformer models. It identifies the scaling parameter in LayerNorm as a pivotal element that amplifies these outliers, often concentrating around certain embedding dimensions and specific tokens. The importance of various outliers varies markedly, with some aggressive outliers being relatively insignificant, allowing them to be clipped without performance degradation - a crucial insight for quantization.
- Gamma Migration: To mitigate the amplification effect, Gamma Migration is introduced. This technique repositions the scaling parameter to subsequent processing modules in the model architecture, thereby generating a quantization-friendly model without additional computational costs. It significantly reduces the outlier magnitude, as evidenced by improvements in cosine similarity measures, showing reduced quantization error compared to conventional methods.
- Token-Wise Clipping: Leveraging the variance in token frequency, Token-Wise Clipping efficiently determines clipping ranges through a coarse-to-fine pipeline. It swiftly bypasses the large, non-essential signal variance tied to a minor subset of tokens, optimizing the clipping range and minimizing quantization loss.
Experimental Validation:
The research conducted extensive empirical benchmarks using BERT, RoBERTa, and BART models across various NLP tasks, including text classification, question answering, and summarization. The proposed methods demonstrably outperform established quantization techniques:
- For BERT and RoBERTa models, the 6-bit quantization achieves accuracy levels close to or even surpassing full precision models.
- In particular, it sets new benchmarks for 6-bit post-training quantization and 4-bit quantization-aware training, establishing it as a viable pathway for deploying models on resource-constrained devices without sacrificing model integrity.
Implications and Future Work:
Fundamentally, this paper redefines the approach to transformer quantization, offering insights that could potentially guide future research in reducing computational burdens in NLP models. The migration and clipping frameworks reveal promising avenues for other domains like computer vision, where similar quantization challenges exist. Future research could further explore pretrained models, which exhibit similar outlier characteristics, or explore pre-training processes for potential modifications that could ease quantization burdens directly.
In conclusion, this paper provides a substantial advancement in understanding and mitigating the bottlenecks caused by structured outliers in transformer quantization, paving the way for more efficient low-bit NLP model deployments.