Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models (2209.13325v3)

Published 27 Sep 2022 in cs.LG

Abstract: Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol \gamma$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.

Outlier Suppression in Low-bit Transformer LLMs

This paper addresses a critical challenge in transformer quantization, focusing on the suppression of structured outliers to enhance the performance of low-bit transformer LLMs beyond existing limitations. The growing use of transformer architectures for NLP tasks brings significant computational and memory demands, necessitating efficient deployment strategies like model quantization. However, quantizing transformers, particularly at low-bit precisions, encounters substantial difficulties due to structured outliers, which can severely degrade performance, often leaving models operating at higher bit levels like 8-bit with notable accuracy drops.

Key Contributions:

  1. Outlier Analysis: The paper first dissects the source of outliers in transformer models. It identifies the scaling parameter γ\bm \gamma in LayerNorm as a pivotal element that amplifies these outliers, often concentrating around certain embedding dimensions and specific tokens. The importance of various outliers varies markedly, with some aggressive outliers being relatively insignificant, allowing them to be clipped without performance degradation - a crucial insight for quantization.
  2. Gamma Migration: To mitigate the amplification effect, Gamma Migration is introduced. This technique repositions the scaling parameter to subsequent processing modules in the model architecture, thereby generating a quantization-friendly model without additional computational costs. It significantly reduces the outlier magnitude, as evidenced by improvements in cosine similarity measures, showing reduced quantization error compared to conventional methods.
  3. Token-Wise Clipping: Leveraging the variance in token frequency, Token-Wise Clipping efficiently determines clipping ranges through a coarse-to-fine pipeline. It swiftly bypasses the large, non-essential signal variance tied to a minor subset of tokens, optimizing the clipping range and minimizing quantization loss.

Experimental Validation:

The research conducted extensive empirical benchmarks using BERT, RoBERTa, and BART models across various NLP tasks, including text classification, question answering, and summarization. The proposed methods demonstrably outperform established quantization techniques:

  • For BERT and RoBERTa models, the 6-bit quantization achieves accuracy levels close to or even surpassing full precision models.
  • In particular, it sets new benchmarks for 6-bit post-training quantization and 4-bit quantization-aware training, establishing it as a viable pathway for deploying models on resource-constrained devices without sacrificing model integrity.

Implications and Future Work:

Fundamentally, this paper redefines the approach to transformer quantization, offering insights that could potentially guide future research in reducing computational burdens in NLP models. The migration and clipping frameworks reveal promising avenues for other domains like computer vision, where similar quantization challenges exist. Future research could further explore pretrained models, which exhibit similar outlier characteristics, or explore pre-training processes for potential modifications that could ease quantization burdens directly.

In conclusion, this paper provides a substantial advancement in understanding and mitigating the bottlenecks caused by structured outliers in transformer quantization, paving the way for more efficient low-bit NLP model deployments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiuying Wei (10 papers)
  2. Yunchen Zhang (14 papers)
  3. Xiangguo Zhang (6 papers)
  4. Ruihao Gong (40 papers)
  5. Shanghang Zhang (172 papers)
  6. Qi Zhang (784 papers)
  7. Fengwei Yu (23 papers)
  8. Xianglong Liu (128 papers)
Citations (115)