Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Zero-Shot Knowledge Distillation for Natural Language Processing (2012.15495v1)

Published 31 Dec 2020 in cs.CL and cs.LG

Abstract: Knowledge Distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based NLP solutions. In its regular manifestations, KD requires access to the teacher's training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-Shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out of domain data and adversarial training to learn the teacher's output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher's classification score (accuracy or F1) while compressing the model 30 times.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ahmad Rashid (24 papers)
  2. Vasileios Lioutas (16 papers)
  3. Abbas Ghaddar (18 papers)
  4. Mehdi Rezagholizadeh (78 papers)
Citations (25)