GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (2306.06629v1)
Abstract: Currently, the reduction in the parameter scale of large-scale pre-trained LLMs (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.
- Knowledge distillation from internal representations. In AAAI, pages 7350–7357.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335.
- Lrc-bert: Latent-representation contrastive knowledge distillation for natural language understanding. In AAAI, pages 12830–12838.
- Rail-kd: Random intermediate layer mapping for knowledge distillation. In NAACL, pages 1389–1400.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Continuation kd: Improved knowledge distillation through the lens of continuation optimization. In EMNLP, page 5260–5269.
- Annealing knowledge distillation. In EACL, pages 2493–2504.
- Tinybert: Distilling bert for natural language understanding. In EMNLP, pages 4163–4174.
- Dynamic knowledge distillation for pre-trained language models. In EMNLP, pages 379–389.
- Multi-granularity structural knowledge distillation for language model compression. In ACL, pages 1001–1011.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Marginal utility diminishes: Exploring the minimum knowledge for bert knowledge distillation. In ACL, pages 2928–2941.
- Improved knowledge distillation via teacher assistant. In AAAI, pages 5191–5198.
- Distilling linguistic context for language model compression. In EMNLP, pages 364–378.
- Alp-kd: Attention-based layer projection for knowledge distillation. In AAAI, pages 13657–13665.
- Zero: memory optimizations toward training trillion parameter models. In SC, page 20.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506.
- Zero-offload: Democratizing billion-scale model training. In ATC, pages 551–564. USENIX Association.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Densely guided knowledge distillation using multiple teacher assistants. In ICCV, pages 9375–9384.
- Patient knowledge distillation for bert model compression. In EMNLP, pages 4322–4331.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. In ACL, pages 2158–2170.
- Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pages 3261–3275.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In ACL, pages 2140–2151.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In NeurIPS.
- One teacher is enough? pre-trained language model distillation from multiple teachers. In ACL, pages 4408–4413.
- Universal-kd: Attention-based output-grounded intermediate layer knowledge distillation. In EMNLP, pages 7649–7661.
- Causal distillation for language models. In NAACL, pages 4288–4295.
- Bert-of-theseus: Compressing bert by progressive module replacing. In EMNLP, pages 7859–7869.
- Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM, pages 690–698.
- Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS, pages 5754–5764.
- Textbrewer: An open-source knowledge distillation toolkit for natural language processing. In ACL, pages 9–16.
- Xdai: A tuning-free framework for exploiting pre-trained language models in knowledge grounded dialogue generation. In KDD, pages 4422–4432.
- Reinforced multi-teacher selection for knowledge distillation. In AAAI, pages 14284–14291.
- GLM-130B: an open bilingual pre-trained model. In ICLR.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, pages 19–27. IEEE Computer Society.
- Shicheng Tan (5 papers)
- Weng Lam Tam (8 papers)
- Yuanchun Wang (6 papers)
- Wenwen Gong (4 papers)
- Yang Yang (884 papers)
- Hongyin Tang (9 papers)
- Keqing He (47 papers)
- Jiahao Liu (72 papers)
- Jingang Wang (71 papers)
- Shu Zhao (31 papers)
- Peng Zhang (642 papers)
- Jie Tang (302 papers)