Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method (2306.06625v1)
Abstract: The large scale of pre-trained LLMs poses a challenge for their deployment on various devices, with a growing emphasis on methods to compress these models, particularly knowledge distillation. However, current knowledge distillation methods rely on the model's intermediate layer features and the golden labels (also called hard labels), which usually require aligned model architecture and enough labeled data respectively. Moreover, the parameters of vocabulary are usually neglected in existing methods. To address these problems, we propose a general LLM distillation (GLMD) method that performs two-stage word prediction distillation and vocabulary compression, which is simple and surprisingly shows extremely strong performance. Specifically, GLMD supports more general application scenarios by eliminating the constraints of dimension and structure between models and the need for labeled datasets through the absence of intermediate layers and golden labels. Meanwhile, based on the long-tailed distribution of word frequencies in the data, GLMD designs a strategy of vocabulary compression through decreasing vocabulary size instead of dimensionality. Experimental results show that our method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark, achieving an average score that surpasses the best method by 3%.
- Knowledge distillation from internal representations. In AAAI, pages 7350–7357.
- Stochastic precision ensemble: Self-knowledge distillation for quantized deep neural networks. In AAAI, pages 6794–6802.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
- Hrkd: Hierarchical relational knowledge distillation for cross-domain language model compression. In EMNLP, pages 3126–3136.
- Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335.
- Neural architecture search: A survey. JMLR, 20:55:1–55:21.
- Lrc-bert: Latent-representation contrastive knowledge distillation for natural language understanding. In AAAI, pages 12830–12838.
- Rail-kd: Random intermediate layer mapping for knowledge distillation. In NAACL, pages 1389–1400.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Sparse progressive distillation: Resolving overfitting under pretrain-and-finetune paradigm. In ACL, pages 190–200.
- Continuation kd: Improved knowledge distillation through the lens of continuation optimization. In EMNLP, page 5260–5269.
- Annealing knowledge distillation. In EACL, pages 2493–2504.
- Tinybert: Distilling bert for natural language understanding. In EMNLP, pages 4163–4174.
- Albert: A lite bert for self-supervised learning of language representations. In ICLR.
- Dynamic knowledge distillation for pre-trained language models. In EMNLP, pages 379–389.
- Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization. In ACL, pages 203–211.
- Multi-granularity structural knowledge distillation for language model compression. In ACL, pages 1001–1011.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Marginal utility diminishes: Exploring the minimum knowledge for bert knowledge distillation. In ACL, pages 2928–2941.
- Learning to explore distillability and sparsability: a joint framework for model compression. TPAMI, pages 1–18.
- Rw-kd: Sample-wise loss terms re-weighting for knowledge distillation. In EMNLP, pages 3145–3152.
- Improved knowledge distillation via teacher assistant. In AAAI, pages 5191–5198.
- Meta-kd: A meta knowledge distillation framework for language model compression across domains. In ACL, pages 3026–3036.
- Distilling linguistic context for language model compression. In EMNLP, pages 364–378.
- Alp-kd: Attention-based layer projection for knowledge distillation. In AAAI, pages 13657–13665.
- Improving language understanding by generative pre-training. OpenAI Blog.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Densely guided knowledge distillation using multiple teacher assistants. In ICCV, pages 9375–9384.
- Patient knowledge distillation for bert model compression. In EMNLP, pages 4322–4331.
- Contrastive distillation on intermediate representations for language model compression. In EMNLP, pages 498–508.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. In ACL, pages 2158–2170.
- Kroneckerbert: Significant compression of pre-trained language models through kronecker decomposition and knowledge distillation. In NAACL, pages 2116–2127.
- Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pages 3261–3275.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
- Lin Wang and Kuk-Jin Yoon. 2022. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. TPAMI, 44(6):3048–3068.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In ACL, pages 2140–2151.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In NeurIPS.
- One teacher is enough? pre-trained language model distillation from multiple teachers. In ACL, pages 4408–4413.
- Universal-kd: Attention-based output-grounded intermediate layer knowledge distillation. In EMNLP, pages 7649–7661.
- Causal distillation for language models. In NAACL, pages 4288–4295.
- Bert-of-theseus: Compressing bert by progressive module replacing. In EMNLP, pages 7859–7869.
- Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM, pages 690–698.
- Reinforced multi-teacher selection for knowledge distillation. In AAAI, pages 14284–14291.
- GLM-130B: an open bilingual pre-trained model. CoRR, abs/2210.02414.
- Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers. In AAAI, pages 11685–11693.
- Bert learns to teach: Knowledge distillation with meta learning. In ACL, pages 7037–7049.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, pages 19–27. IEEE Computer Society.