GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (2306.06629v1)

Published 11 Jun 2023 in cs.CL and cs.AI

Abstract: Currently, the reduction in the parameter scale of large-scale pre-trained LLMs (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (39)

Authors (12)

Shicheng Tan (5 papers)
Weng Lam Tam (8 papers)
Yuanchun Wang (6 papers)
Wenwen Gong (4 papers)
Yang Yang (884 papers)
Hongyin Tang (9 papers)
Keqing He (47 papers)
Jiahao Liu (72 papers)
Jingang Wang (71 papers)
Shu Zhao (31 papers)
Peng Zhang (642 papers)
Jie Tang (302 papers)

Citations (11)

View on Semantic Scholar

GitHub

GitHub - aitsc/GLMKD: Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method ; GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (32 stars)

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model (2306.06629v1)

Related Papers

GitHub