Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (2306.06629v1)

Published 11 Jun 2023 in cs.CL and cs.AI

Abstract: Currently, the reduction in the parameter scale of large-scale pre-trained LLMs (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Knowledge distillation from internal representations. In AAAI, pages 7350–7357.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
  3. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335.
  4. Lrc-bert: Latent-representation contrastive knowledge distillation for natural language understanding. In AAAI, pages 12830–12838.
  5. Rail-kd: Random intermediate layer mapping for knowledge distillation. In NAACL, pages 1389–1400.
  6. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
  7. Continuation kd: Improved knowledge distillation through the lens of continuation optimization. In EMNLP, page 5260–5269.
  8. Annealing knowledge distillation. In EACL, pages 2493–2504.
  9. Tinybert: Distilling bert for natural language understanding. In EMNLP, pages 4163–4174.
  10. Dynamic knowledge distillation for pre-trained language models. In EMNLP, pages 379–389.
  11. Multi-granularity structural knowledge distillation for language model compression. In ACL, pages 1001–1011.
  12. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  13. Marginal utility diminishes: Exploring the minimum knowledge for bert knowledge distillation. In ACL, pages 2928–2941.
  14. Improved knowledge distillation via teacher assistant. In AAAI, pages 5191–5198.
  15. Distilling linguistic context for language model compression. In EMNLP, pages 364–378.
  16. Alp-kd: Attention-based layer projection for knowledge distillation. In AAAI, pages 13657–13665.
  17. Zero: memory optimizations toward training trillion parameter models. In SC, page 20.
  18. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506.
  19. Zero-offload: Democratizing billion-scale model training. In ATC, pages 551–564. USENIX Association.
  20. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
  21. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  22. Densely guided knowledge distillation using multiple teacher assistants. In ICCV, pages 9375–9384.
  23. Patient knowledge distillation for bert model compression. In EMNLP, pages 4322–4331.
  24. Mobilebert: a compact task-agnostic bert for resource-limited devices. In ACL, pages 2158–2170.
  25. Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962.
  26. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pages 3261–3275.
  27. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In ACL, pages 2140–2151.
  28. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In NeurIPS.
  29. One teacher is enough? pre-trained language model distillation from multiple teachers. In ACL, pages 4408–4413.
  30. Universal-kd: Attention-based output-grounded intermediate layer knowledge distillation. In EMNLP, pages 7649–7661.
  31. Causal distillation for language models. In NAACL, pages 4288–4295.
  32. Bert-of-theseus: Compressing bert by progressive module replacing. In EMNLP, pages 7859–7869.
  33. Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM, pages 690–698.
  34. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, pages 5754–5764.
  35. Textbrewer: An open-source knowledge distillation toolkit for natural language processing. In ACL, pages 9–16.
  36. Xdai: A tuning-free framework for exploiting pre-trained language models in knowledge grounded dialogue generation. In KDD, pages 4422–4432.
  37. Reinforced multi-teacher selection for knowledge distillation. In AAAI, pages 14284–14291.
  38. GLM-130B: an open bilingual pre-trained model. In ICLR.
  39. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, pages 19–27. IEEE Computer Society.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Shicheng Tan (5 papers)
  2. Weng Lam Tam (8 papers)
  3. Yuanchun Wang (6 papers)
  4. Wenwen Gong (4 papers)
  5. Yang Yang (884 papers)
  6. Hongyin Tang (9 papers)
  7. Keqing He (47 papers)
  8. Jiahao Liu (72 papers)
  9. Jingang Wang (71 papers)
  10. Shu Zhao (31 papers)
  11. Peng Zhang (642 papers)
  12. Jie Tang (302 papers)
Citations (11)