Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (2002.10957v2)

Published 25 Feb 2020 in cs.CL
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Abstract: Pre-trained LLMs (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

Overview of "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers"

The paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers" introduces an innovative method aimed at compressing large Transformer-based pre-trained models. Given the significant impact and computational demands of models like BERT, the authors propose a distillation technique termed "deep self-attention distillation". This approach effectively reduces model size and inference time while maintaining high performance on various NLP tasks.

Key Contributions

The authors present several key contributions:

  1. Deep Self-Attention Distillation: The core of the methodology involves the student model mimicking the self-attention behaviors of the teacher model's last Transformer layer. This includes:
    • Self-Attention Distribution Transfer: Utilizing the attention distributions (scaled dot-products of queries and keys).
    • Self-Attention Value-Relation Transfer: Introducing the relation between values in the self-attention module as an additional distillation target (scaled dot-products of values).
  2. Flexibility in Student Model Architecture: Unlike previous approaches that require strict layer-to-layer mapping, this method allows for flexibility in the number of layers and hidden sizes in the student model.
  3. Teacher Assistant: For scenarios where the student model is significantly smaller than the teacher model, the paper proposes introducing a teacher assistant. This intermediate model bridges the gap and facilitates more effective distillation.

Experimental Results

The experimental results underscore the efficacy of the proposed method. The key findings include:

  • Performance and Compression:
    • On the SQuAD 2.0 dataset, the MiniLM model retains over 99% of the teacher model's performance while being twice as fast.
    • For the GLUE benchmark, the MiniLM model exhibits strong performance across various tasks with substantial reduction in parameters and computational requirements.
  • Comparison with Baselines:
    • The 6-layer MiniLM model outperforms previous state-of-the-art models such as DistillBERT and TinyBERT in nearly all evaluated tasks.
    • Specifically, the MiniLM model achieves a 76.4% F1 score on SQuAD 2.0, surpassing DistillBERT’s 70.7% and TinyBERT’s 73.1%.

Implications and Future Directions

The implications of this research are manifold:

  • Practical Deployment: The reduced model size and lower latency make MiniLM highly suitable for deployment in real-world applications where computational resources and response times are critical factors.
  • Theoretical Insights: Introducing value-relation transfer offers a new perspective on how intermediate representations and their relationships can be leveraged in model compression.

Looking ahead, this approach could be extended to larger pre-trained models, potentially including layers from even more sophisticated architectures. Furthermore, the framework could be adapted for multilingual scenarios as indicated by initial experiments with multilingual MiniLM on the XNLI and MLQA benchmarks, where strong performances were observed despite the substantial reduction in model size.

Tables and Performance Metrics

The paper includes detailed tables showcasing the performance across different model configurations. For instance:

  • Table 1 compares different architectures and demonstrates the effectiveness of self-attention value-relation transfer.
  • Table 2 illustrates the speedup and parameter reduction, emphasizing the practical benefits of MiniLM.

Conclusion

In conclusion, the MiniLM approach represents a significant advancement in the efficient compression of large pre-trained LLMs. By focusing on deep self-attention mechanisms and providing flexibility in student model architecture, alongside potential assistance from teacher assistants, this method achieves a commendable balance between performance and computational efficiency. The insights from this work pave the way for further explorations into distillation techniques and their applications in various AI domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenhui Wang (47 papers)
  2. Furu Wei (291 papers)
  3. Li Dong (154 papers)
  4. Hangbo Bao (17 papers)
  5. Nan Yang (182 papers)
  6. Ming Zhou (182 papers)
Citations (1,058)
Youtube Logo Streamline Icon: https://streamlinehq.com