Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weight Distillation: Transferring the Knowledge in Neural Network Parameters (2009.09152v3)

Published 19 Sep 2020 in cs.CL

Abstract: Knowledge distillation has been proven to be effective in model acceleration and compression. It allows a small network to learn to generalize in the same way as a large network. Recent successes in pre-training suggest the effectiveness of transferring model parameters. Inspired by this, we investigate methods of model acceleration and compression in another line of research. We propose Weight Distillation to transfer the knowledge in the large network parameters through a parameter generator. Our experiments on WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight distillation can train a small network that is 1.88~2.94x faster than the large network but with competitive performance. With the same sized small network, weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU points.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ye Lin (20 papers)
  2. Yanyang Li (22 papers)
  3. Ziyang Wang (59 papers)
  4. Bei Li (51 papers)
  5. Quan Du (8 papers)
  6. Tong Xiao (119 papers)
  7. Jingbo Zhu (79 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.