Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q8BERT: Quantized 8Bit BERT (1910.06188v2)

Published 14 Oct 2019 in cs.CL and cs.LG
Q8BERT: Quantized 8Bit BERT

Abstract: Recently, pre-trained Transformer based LLMs such as BERT and GPT, have shown great improvement in many NLP tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Q8BERT: Quantized 8Bit BERT

The adaptation of large pre-trained Transformer-based LLMs, such as BERT, into production environments necessitates innovative approaches to reduce their computational and memory demands. The paper "Q8BERT: Quantized 8Bit BERT" addresses this challenge by presenting a method for compressing BERT to enable efficient inference while maintaining minimal accuracy loss. The authors propose a technique involving quantization-aware training during the fine-tuning phase to achieve a compression factor of four through the use of 8bit integer arithmetic.

Methodology

The researchers employ symmetric linear quantization to convert BERT's weights and activations into 8bit integers. The quantization scheme focuses on minimizing the memory footprint and accelerating inference on hardware that supports integer calculations. By simulating quantized inference during training, the proposed quantization-aware training approach enables the model to adapt to quantization errors effectively.

Results and Evaluation

The paper reports empirical results obtained from applying their quantization method across several NLP tasks from the GLUE benchmark and SQuADv1.1. The quantization-aware trained models (denoted as QAT) achieved accuracy within 1% of the floating-point benchmarks for most tasks, significantly outperforming dynamically quantized models (DQ) which were subject to higher accuracy degradation. Notably, reductions in model size by a factor of four were achieved with negligible loss in performance.

Implications and Future Directions

The quantization approach detailed in the paper has significant implications for deploying NLP models in resource-constrained environments. By reducing the model size and enabling faster inference through the exploitation of 8bit integer arithmetic, this method supports the development of NLP applications with low latency requirements on a wide range of computational platforms.

Future directions suggested by the authors include further exploration of model compression techniques that could complement quantization, potentially offering additional reductions in memory usage and power consumption. Such advancements are crucial for deploying models like BERT in scenarios with strict constraints on computational resources.

By integrating these techniques into NLP systems, the efficiency gains may also contribute to the broader field of AI by enabling more sustainable and accessible deployment of advanced LLMs, facilitating their use across diverse applications worldwide.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ofir Zafrir (5 papers)
  2. Guy Boudoukh (5 papers)
  3. Peter Izsak (10 papers)
  4. Moshe Wasserblat (22 papers)
Citations (480)