Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Light-Weight Translation Models from Deep Transformer (2012.13866v1)

Published 27 Dec 2020 in cs.CL

Abstract: Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Bei Li (51 papers)
  2. Ziyang Wang (59 papers)
  3. Hui Liu (481 papers)
  4. Quan Du (8 papers)
  5. Tong Xiao (119 papers)
  6. Chunliang Zhang (12 papers)
  7. Jingbo Zhu (79 papers)
Citations (38)

Summary

We haven't generated a summary for this paper yet.