Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

I-BERT: Integer-only BERT Quantization (2101.01321v3)

Published 5 Jan 2021 in cs.CL
I-BERT: Integer-only BERT Quantization

Abstract: Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this, previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4-4.0x for INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has been open-sourced.

Insights into I-BERT: Integer-only BERT Quantization

The paper "I-BERT: Integer-only BERT Quantization" introduces an effective quantization scheme for Transformer-based models, such as BERT and RoBERTa, to enable exclusive integer arithmetic during inference. This development addresses significant challenges faced by contemporary large-scale models concerning memory consumption, inference latency, and power usage. The I-BERT approach is both timely and relevant given the pressing need for deploying such models efficiently at the edge and even in constrained data center environments.

Core Contributions

The authors propose an integer-only quantization scheme, distinctly deviating from previous methods that relied on floating-point arithmetic. This approach effectively ensures that calculations are performed entirely using integer operations, facilitating deployment on hardware systems equipped with integer-only processing units like ARM Cortex-M processors and the Turing Tensor Cores.

Key contributions of the I-BERT framework include:

  1. Polynomial Approximation: The paper describes lightweight integer-only approximation methods for non-linear functions integral to Transformer architectures, including GELU, Softmax, and Layer Normalization. These approximations employ low-order polynomial expressions to maintain inferencing accuracy while adhering to integer arithmetic.
  2. Implementation and Deployment: The I-BERT framework is implemented utilizing the PyTorch library and has been open-sourced, promoting transparency and reproducibility. The integer-only quantization has been applied to RoBERTa-Base/Large models and shows promising results on various GLUE downstream tasks.
  3. Accuracy Retention: Remarkably, I-BERT maintains a competitive accuracy level compared to full-precision baselines, achieving slightly higher scores in some cases with minimal degradation in others.
  4. Latency Improvements: The integer-only inference implementation achieves significant speedup—ranging from 2.4 to 4 times—compared to the FP32 implementation on a T4 GPU system, marking a considerable enhancement in efficiency.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the introduction of integer-only quantization makes it feasible to deploy Transformer models on edge devices where memory, computational power, and energy consumption are critically constrained. This advancement widens the application possibilities for NNs in real-world scenarios, where resources are often limited.

Theoretically, the paper challenges the necessity of floating-point precision in maintaining model performance across a range of NLP tasks. The success of I-BERT's polynomial approximation in preserving the functional aspects of non-linear operations could inform further research into low-resource model deployment strategies.

As AI technologies develop, this work opens up several avenues for exploration. Future research may look into optimized training procedures that incorporate integer-only arithmetic from the start to enhance the end-to-end efficiency of model deployment. Additionally, exploring the application of integer-only techniques beyond NLP to other domains of machine learning could provide new insights and innovations.

In conclusion, the paper presents a significant advancement in the quantization of Transformer models. It aligns with the industry's demand for efficient AI solutions and pushes the boundaries of existing hardware potential without compromising on performance. This equilibrium between efficiency and efficacy is essential for the continued integration of AI into our technological fabric.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sehoon Kim (30 papers)
  2. Amir Gholami (60 papers)
  3. Zhewei Yao (64 papers)
  4. Michael W. Mahoney (233 papers)
  5. Kurt Keutzer (199 papers)
Citations (304)
Github Logo Streamline Icon: https://streamlinehq.com