Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Patient Knowledge Distillation for BERT Model Compression (1908.09355v1)

Published 25 Aug 2019 in cs.CL and cs.LG
Patient Knowledge Distillation for BERT Model Compression

Abstract: Pre-trained LLMs such as BERT have proven to be highly effective for NLP tasks. However, the high demand for computing resources in training such models hinders their application in practice. In order to alleviate this resource hunger in large-scale model training, we propose a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student). Different from previous knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model patiently learns from multiple intermediate layers of the teacher model for incremental knowledge extraction, following two strategies: ($i$) PKD-Last: learning from the last $k$ layers; and ($ii$) PKD-Skip: learning from every $k$ layers. These two patient distillation schemes enable the exploitation of rich information in the teacher's hidden layers, and encourage the student model to patiently learn from and imitate the teacher through a multi-layer distillation process. Empirically, this translates into improved results on multiple NLP tasks with significant gain in training efficiency, without sacrificing model accuracy.

Patient Knowledge Distillation for BERT Model Compression

The paper "Patient Knowledge Distillation for BERT Model Compression" presents a novel methodology for compressing large pre-trained LLMs, specifically BERT, while maintaining their performance on various NLP tasks. The technique proposed, termed "Patient Knowledge Distillation" (Patient-KD), utilizes multiple intermediate layers of the teacher model to guide the training of a more compact student model. This is in contrast to traditional knowledge distillation methods that primarily leverage the output from the last layer of the teacher model.

Key Contributions

  1. Patient Knowledge Distillation Approach:

The proposed Patient-KD method deviates from conventional knowledge distillation by facilitating the student model to learn from multiple layers of the teacher model. Two strategies are introduced: - PKD-Last: The student model learns from the last kk layers of the teacher model. - PKD-Skip: The student model learns from every kk layers of the teacher model.

  1. Model Compression: A successful demonstration of compressing BERT-Base (12 layers, 110M parameters) into 6-layer and 3-layer student models without significant loss of performance across various NLP benchmarks, achieving notable improvements in terms of model efficiency and inference speed.
  2. Comprehensive Evaluation: The paper evaluates the Patient-KD approach on multiple NLP tasks, including sentiment classification, paraphrase similarity matching, natural language inference, and machine reading comprehension, using datasets like SST-2, MRPC, QQP, MNLI, QNLI, RTE, and RACE.

Methodology

Distillation Objective:

The model compression process involves a distillation loss function combining the cross-entropy loss of the student's predictions with the ground truth and the KL-divergence between the student’s and teacher’s output distributions. The final objective includes an additional term for the Patient-KD loss, which measures the discrepancy between the normalized hidden states of the student and teacher models at selected intermediate layers.

Implementation Details:

Initial experiments involve directly fine-tuning 3-layer and 6-layer student models on specific tasks. Subsequently, vanilla knowledge distillation is applied, followed by Patient-KD, systematically comparing the performance improvements. Hyper-parameters such as temperature TT, balancing factor α\alpha, and distillation coefficient β\beta are fine-tuned to optimize performance.

Experimental Results

Performance:

The results indicate that Patient-KD consistently outperforms the baseline and vanilla KD methods. Specifically, on the GLUE benchmark tasks, the 6-layer student model trained with Patient-KD achieves similar performance to the teacher model on tasks with substantial training data, like SST-2 and QQP, while reducing model size and inference time significantly.

Inference Efficiency:

Across various experiments, the compressed models achieve 1.64 to 2.4 times reduction in parameter size and 1.94 to 3.73 times improvement in inference speed compared to the original 12-layer BERT model.

Impact of Teacher Quality:

The paper also explores the effect of using larger teacher models such as BERT-Large (24 layers, 330M parameters) on student performance. Interestingly, results indicate that while a better teacher model (BERT-Large) generally provides more robust guidance, the most significant performance gains are realized when sufficient training data is available for the target tasks.

Implications and Future Work

Practical Implications:

The proposed Patient-KD approach offers a feasible solution for deploying BERT models in resource-constrained environments, balancing the trade-off between model accuracy and computational efficiency.

Theoretical Implications:

From a theoretical standpoint, this work underscores the value of leveraging rich internal layer representations in neural networks for effective model distillation, broadening the scope beyond the final layer’s logits typically used in traditional KD.

Future Developments:

Future research directions include pre-training BERT from scratch to address potential initialization mismatches, exploring sophisticated distance metrics for loss functions, and extending this approach to multi-task learning and meta learning frameworks.

In conclusion, Patient Knowledge Distillation for BERT Model Compression represents a significant advancement in model efficiency techniques, achieving competitive performance while substantially reducing computational demands. This work sets a precedent for future research in compressing and optimizing large pre-trained LLMs without compromising their intrinsic capabilities on diverse NLP tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Siqi Sun (46 papers)
  2. Yu Cheng (354 papers)
  3. Zhe Gan (135 papers)
  4. Jingjing Liu (139 papers)
Citations (787)
Youtube Logo Streamline Icon: https://streamlinehq.com