Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing (2306.12929v2)

Published 22 Jun 2023 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Transformer models have been widely adopted in various domains over the last years, and especially LLMs have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both LLMs (BERT, OPT) and vision transformers.

PDF Abstract

Overview of Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

The paper "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing" addresses a critical issue in the field of AI—specifically, the quantile-friendly adaptation of transformer models. Transformer architectures have gained immense popularity due to their substantial impact across various domains. However, their computational demands have surged with their capabilities, posing significant challenges for efficient deployment, especially in resource-constrained environments.

Quantization Challenges

Quantization, a technique that reduces the bit-width required for neural network computations, is highly beneficial for decreasing memory and computational overhead. Despite its advantages, quantizing transformer models remains problematic due to the prevalence of outlier values in activations, which inflate the necessary dynamic range and thus undermine the process. These outliers have typically been tied to specific behaviors of attention heads, which attempt to learn non-updating or partial updating functions—actions referred to as "no-operation" (no-op).

Through rigorous analysis, the authors identify that attention heads push softmax inputs to extreme values during training to represent no-updates accurately. As a result, other parts of transformer networks develop outliers in their activations, complicating the quantization process further. Prior methods addressing this issue have either relied on higher bit-width formats or intricate post-training modifications, which are less efficient and unwieldy.

Proposed Solutions

In response to these quantization challenges, the paper introduces two modifications to the attention mechanism: clipped softmax and gated attention.

Clipped Softmax: This alteration applies a stretch and clip operation to the softmax function, permitting exact zeros (or ones) in the output through finite input ranges. It directly curtails outliers by mitigating the steep requirements for dynamic range inherent to attention heads.
Gated Attention: This method involves adding a learnable gating mechanism to the attention process that conditionally nullifies updates to representations, minimizing reliance on exceedingly large outlier values to achieve similar outcomes.

Both methods have demonstrated empirical success in learning models with considerably smaller outliers while maintaining—or even enhancing—task performance. Significantly, these approaches enable full INT8 quantization without additional custom adaptations, demonstrated across multipurpose models like BERT, OPT, and vision models such as ViT.

Results and Implications

The authors present comprehensive experiments showcasing the reduction of outliers and improved quantized model performance. By facilitating quantization-friendly behavior from the onset, implementations can bypass post-processing complexities that often hinder practical deployment. Moreover, these techniques pose interesting insights into network regularization and architectural design, implying potential improvements for training efficiency and model generalization.

Quantization of neural networks, especially transformers, is pivotal in moving computational capabilities onto edge devices, promising privacy-preserving applications by reducing dependence on cloud-based inference. While requiring attention to detail in training from scratch, methods proposed in this paper exhibit scalability to larger models, promising broader applicability.

In anticipation of advancements in AI, research like this provides foundational methods to align transformative model architectures with practical deployment constraints without compromising performance, ensuring ongoing relevance as models scale further. The continued exploration of activation regularization, network design, and fine-tuning strategies encouraged by these findings will undoubtedly drive future AI developments.