Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TinyBERT: Distilling BERT for Natural Language Understanding (1909.10351v5)

Published 23 Sep 2019 in cs.CL, cs.AI, and cs.LG
TinyBERT: Distilling BERT for Natural Language Understanding

Abstract: LLM pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained LLMs are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

TinyBERT: Distilling BERT for Natural Language Understanding

The paper "TinyBERT: Distilling BERT for Natural Language Understanding" presents a method to reduce the size and inference time of BERT models while retaining substantial performance. This paper introduces TinyBERT, a distilled version of BERT designed specifically for deployment in resource-constrained environments.

Key Contributions

The paper makes the following key contributions:

  1. Transformer Distillation Method: This novel method is tailored to the needs of Transformer-based models like BERT. It encompasses multiple distillation objectives that ensure the transfer of crucial knowledge from the teacher BERT model to the compact TinyBERT student model.
  2. Two-stage Learning Framework: The framework consists of:
    • General Distillation: Applied during the pre-training phase to capture general-domain knowledge. This phase ensures that TinyBERT inherits the linguistic generalization capability of the base BERT.
    • Task-specific Distillation: Conducted during the fine-tuning phase. It further refines TinyBERT by learning task-specific knowledge from the fine-tuned base BERT.

Experimental Results

Quantitative Outcomes

  • Model Efficiency: The TinyBERT4_4 model, with 4 layers, achieves approximately 96.8% of the performance of BERTBASE_{\rm BASE} on the GLUE benchmark. This is accomplished while being 7.5 times smaller and 9.4 times faster in terms of inference.
  • Comparative Performance: TinyBERT4_4 surpasses other 4-layer distilled models like BERT4_{4}-PKD and DistilBERT4_4 with roughly 28% of the parameters and 31% of the inference time compared to these models.

Model Architecture and Settings

  • Student Model: TinyBERT4_4 has a hidden size of 312 and a feed-forward size of 1200.
  • Teacher Model: BERTBASE_{\rm BASE} is utilized as the teacher model with its 12 layers and a hidden size of 768.
  • Mapping Function: The layer mapping function adopted is g(m)=3×mg(m)=3\times m for effective knowledge transfer.

Analysis and Implications

Importance of Learning Procedures

The paper's ablation analysis demonstrates the necessity of both general and task-specific distillation procedures:

  • General Distillation (GD): Provides a stable initialization by transferring general-domain information.
  • Task Specific Distillation (TD): Further optimizes TinyBERT on specific tasks using augmented datasets.

Theoretical and Practical Implications

Theoretically, the proposed two-stage learning framework successfully captures both generalized and specialized knowledge, creating a robust small model performance-wise. Practically, TinyBERT’s substantial improvements in size and speed make it suitable for deployment on edge devices, such as mobile phones.

Future Directions

The research opens several pathways for future developments:

  1. Distillation from Larger Models: Extending the distillation techniques to wider (e.g., BERTLARGE_{\rm LARGE}) or deeper models.
  2. Hybrid Compression Techniques: Combining knowledge distillation with other compression methods like quantization and pruning to achieve even more lightweight models suitable for diverse applications.

In conclusion, the paper meticulously addresses the challenge of compressing BERT models without significant performance degradation, thereby facilitating their usability in resource-constrained environments. It establishes a robust foundation upon which future model compression techniques can be built and refined.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiaoqi Jiao (8 papers)
  2. Yichun Yin (27 papers)
  3. Lifeng Shang (90 papers)
  4. Xin Jiang (242 papers)
  5. Xiao Chen (277 papers)
  6. Linlin Li (31 papers)
  7. Fang Wang (116 papers)
  8. Qun Liu (230 papers)
Citations (1,698)