Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

TinyBERT: Distilling BERT for Natural Language Understanding (1909.10351v5)

Published 23 Sep 2019 in cs.CL, cs.AI, and cs.LG

Abstract: LLM pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained LLMs are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

PDF Abstract

TinyBERT: Distilling BERT for Natural Language Understanding

The paper "TinyBERT: Distilling BERT for Natural Language Understanding" presents a method to reduce the size and inference time of BERT models while retaining substantial performance. This paper introduces TinyBERT, a distilled version of BERT designed specifically for deployment in resource-constrained environments.

Key Contributions

The paper makes the following key contributions:

Transformer Distillation Method: This novel method is tailored to the needs of Transformer-based models like BERT. It encompasses multiple distillation objectives that ensure the transfer of crucial knowledge from the teacher BERT model to the compact TinyBERT student model.
Two-stage Learning Framework: The framework consists of:
- General Distillation: Applied during the pre-training phase to capture general-domain knowledge. This phase ensures that TinyBERT inherits the linguistic generalization capability of the base BERT.
- Task-specific Distillation: Conducted during the fine-tuning phase. It further refines TinyBERT by learning task-specific knowledge from the fine-tuned base BERT.

Experimental Results

Quantitative Outcomes

Model Efficiency: The TinyBERT $_4$ model, with 4 layers, achieves approximately 96.8% of the performance of BERT $_{\rm BASE}$ on the GLUE benchmark. This is accomplished while being 7.5 times smaller and 9.4 times faster in terms of inference.
Comparative Performance: TinyBERT $_4$ surpasses other 4-layer distilled models like BERT $_{4}$ -PKD and DistilBERT $_4$ with roughly 28% of the parameters and 31% of the inference time compared to these models.

Model Architecture and Settings

Student Model: TinyBERT $_4$ has a hidden size of 312 and a feed-forward size of 1200.
Teacher Model: BERT $_{\rm BASE}$ is utilized as the teacher model with its 12 layers and a hidden size of 768.
Mapping Function: The layer mapping function adopted is $g(m)=3\times m$ for effective knowledge transfer.

Analysis and Implications

Importance of Learning Procedures

The paper's ablation analysis demonstrates the necessity of both general and task-specific distillation procedures:

General Distillation (GD): Provides a stable initialization by transferring general-domain information.
Task Specific Distillation (TD): Further optimizes TinyBERT on specific tasks using augmented datasets.

Theoretical and Practical Implications

Theoretically, the proposed two-stage learning framework successfully captures both generalized and specialized knowledge, creating a robust small model performance-wise. Practically, TinyBERT’s substantial improvements in size and speed make it suitable for deployment on edge devices, such as mobile phones.

Future Directions

The research opens several pathways for future developments:

Distillation from Larger Models: Extending the distillation techniques to wider (e.g., BERT $_{\rm LARGE}$ ) or deeper models.
Hybrid Compression Techniques: Combining knowledge distillation with other compression methods like quantization and pruning to achieve even more lightweight models suitable for diverse applications.

In conclusion, the paper meticulously addresses the challenge of compressing BERT models without significant performance degradation, thereby facilitating their usability in resource-constrained environments. It establishes a robust foundation upon which future model compression techniques can be built and refined.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiaoqi Jiao (8 papers)
Yichun Yin (27 papers)
Lifeng Shang (90 papers)
Xin Jiang (242 papers)
Xiao Chen (277 papers)
Linlin Li (31 papers)
Fang Wang (116 papers)
Qun Liu (230 papers)

Citations (1,698)

View on Semantic Scholar