Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT (2002.11985v2)

Published 27 Feb 2020 in cs.LG and stat.ML

Abstract: Pre-trained Transformer-based models have achieved state-of-the-art performance for various NLP tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted a lot of research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

Overview of Model Compression in Large-Scale Transformers: A Case Study on BERT

The paper "Compressing Large-Scale Transformer-Based Models: A Case Study on BERT" presents an analysis of various methods for reducing the size and computational requirements of BERT and similar Transformer-based models. These models have achieved remarkable success across various NLP tasks, but their extensive parameter count—often in the billions—creates barriers for deployment in environments with resource constraints. The focus of this paper is model compression, aiming to make these models more feasible for low-capability devices and applications with strict latency requirements.

Breakdown of BERT Components

The paper begins with a detailed breakdown of BERT's architecture, highlighting the embedding layer and the Transformer backbone composed of encoder units. Each unit comprises self-attention and feed-forward sub-layers, supported by residual connections. Profiling BERT's resource consumption demonstrates that the feed-forward network (FFN) sub-units are the primary contributors to memory and computational overhead, suggesting potential targets for optimization.

Techniques for Model Compression

The authors evaluate several model compression techniques applicable to BERT:

  • Quantization: This involves reducing the precision of model weights and activations, typically leading to smaller model size. While quantization achieves substantial size reduction, it requires specialized hardware to realize speedup, as evidenced by studies showing up to a 90% reduction in model size with minimal accuracy loss.
  • Pruning: The paper distinguishes between unstructured and structured pruning. Unstructured pruning targets individual weights, achieving high compression of weight matrices but not necessarily leading to runtime improvements without optimized execution. In contrast, structured pruning, which reduces architectural elements like attention heads or encoder units, offers more tangible reductions in model size and inference latency.
  • Knowledge Distillation: In this approach, a smaller student model learns from a larger teacher model. Distillation from output logits, encoder outputs, and attention maps enables the creation of simpler models like BiLSTMs or CNNs without heavy reliance on the original Transformer backbone, resulting in substantial improvements in model efficiency.
  • Matrix Decomposition and Dynamic Inference: These methods focus on reducing computational costs by decomposing large weight matrices or dynamically adjusting the inference process based on input characteristics.

Practical Implications and Future Research

Notably, while many compression techniques provide significant reductions in model size, they often require careful integration of multiple methods to achieve practical runtime improvements on standard hardware. For instance, combining pruning with distillation or optimizing quantization settings can lead to better balance between compression and accuracy.

The paper suggests directions for future research, emphasizing the need for exploring layer-independent compression methods and hybrid models that combine Transformers with alternative architectures to mitigate the inherent resource intensity of Transformer layers. Furthermore, there is potential in examining compound methods and enabling effective adaptations across different deployment scenarios.

Overall, the paper comprehensively surveys existing strategies for BERT compression, providing insights that not only inform practical deployment but also enrich theoretical understanding of Transformer model optimization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Prakhar Ganesh (15 papers)
  2. Yao Chen (187 papers)
  3. Xin Lou (16 papers)
  4. Mohammad Ali Khan (2 papers)
  5. Yin Yang (109 papers)
  6. Hassan Sajjad (64 papers)
  7. Preslav Nakov (253 papers)
  8. Deming Chen (62 papers)
  9. Marianne Winslett (15 papers)
Citations (183)