Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection (2302.00444v1)

Published 1 Feb 2023 in cs.CL

Abstract: Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chenglong Wang (80 papers)
  2. Yi Lu (145 papers)
  3. Yongyu Mu (15 papers)
  4. Yimin Hu (6 papers)
  5. Tong Xiao (119 papers)
  6. Jingbo Zhu (79 papers)
Citations (6)