Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiniLLM: Knowledge Distillation of Large Language Models (2306.08543v4)

Published 14 Jun 2023 in cs.CL and cs.AI
MiniLLM: Knowledge Distillation of Large Language Models

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of LLMs. However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller LLMs. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative LLMs, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/miniLLM.

Analyzing "MiniLLM: Knowledge Distillation of LLMs"

The paper "MiniLLM: Knowledge Distillation of LLMs" explores the under-explored field of knowledge distillation (KD) applied to LLMs, presenting a method to distill LLMs' knowledge into smaller, computationally efficient models. This process aims to maintain the generative prowess of the original models while easing resource demands, a necessity with the proliferation of open-source LLMs.

Key Contributions and Methodology

The authors propose a novel approach that substitutes the standard forward Kullback-Leibler divergence (KLD) in KD with reverse KLD. This transition is crucial for generative LLMs as it prevents the student model from inaccurately assigning high probabilities to low-probability regions distributed by the teacher model. This methodological shift addresses the issue where the complexity of LLM applications surpasses the expressive capacity of smaller student models.

The paper outlines a robust optimization strategy leveraging policy gradient techniques to effectuate this reverse KLD minimization. The method introduces several enhancements:

  1. Single-Step Decomposition: Reduces training variance by isolating single-step generation quality.
  2. Teacher-Mixed Sampling: Reduces reward hacking by incorporating the teacher model's distribution during sampling.
  3. Length Normalization: Addresses sequence length bias, promoting optimal response length during generation.

These intentional strategies forge an effective KD paradigm for LLMs resulting in the proposed models, termed MiniLLMs.

Experimental Validation

Extensive experiments substantiate the MiniLLM framework's advantages:

  • MiniLLMs exhibit superior performance across various instruction-following evaluations, spanning models with parameters ranging from 120M to 13B.
  • Analysis shows pragmatic improvements with reduced exposure bias and enhanced calibration. Notably, in many cases, distilled models exceeded teacher-model performance as quantified by metrics like Rouge-L and GPT-4 feedback.
  • Further tests reveal consistent student model performance enhancements correlated with increasing teacher model sizes, indicating scalability.

Implications and Future Directions

The research underscores the potential of reverse KLD in knowledge distillation for LLMs, presenting promising opportunities for deploying efficient, small-scale models. This advancement could catalyze more widespread application of LLM capabilities with reduced computational overhead. Additionally, its implications for methodologically optimizing model efficiency bear significance in both academic and industrial contexts.

Looking forward, this work establishes a basis for further exploration into distribution metrics beyond reverse KLD and their impact on KD efficacy. Continuing this line of inquiry could foster innovative KD methodologies suitable for evolving complexities in AI applications, ultimately refining our understanding and implementation of scalable LLM technologies.

In summary, this paper articulates a significant refinement to traditional KD strategies, paving the way for deploying LLM-caliber capabilities more broadly and efficiently. This contribution is expected to influence both the theoretical underpinnings and practical deployment of AI-driven language solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 700–710, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021.
  6. Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
  7. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  8. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  9. Language gans falling short. In International Conference on Learning Representations, 2020.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, 2019.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019.
  15. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
  16. Openwebtext corpus, 2019.
  17. Google. Bard, 2023.
  18. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  19. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
  20. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  21. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017.
  22. Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Pre-trained models: Past, present and future. AI Open, 2021.
  25. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  26. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020.
  27. Tailoring language generation models under total variation distance. In The Eleventh International Conference on Learning Representations, 2023.
  28. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  29. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016.
  32. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  33. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out (ACL 2004), 2004.
  34. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  35. Improving text generation with student-forcing optimal transport. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9144–9156, 2020.
  36. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  37. Tom Minka et al. Divergence measures and message passing. Technical report, Citeseer, 2005.
  38. Reverse kl-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.
  39. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023.
  40. Measuring calibration in deep learning. In CVPR workshops, 2019.
  41. OpenAI. Openai: Introducing chatgpt, 2022.
  42. OpenAI. Gpt-4 technical report, 2023.
  43. Training language models to follow instructions with human feedback. In Proceedings of NeurIPS, 2022.
  44. Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021.
  45. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  46. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759–766, 2000.
  47. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023.
  48. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  49. Language models are unsupervised multitask learners. OpenAI Technical report, 2019.
  50. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
  51. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
  54. Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, 2022.
  55. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  56. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, October 2013.
  57. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. arXiv preprint arXiv:2004.12817, 2020.
  58. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  59. Multitask prompted training enables zero-shot task generalization. In Proceedings of ICLR, 2022.
  60. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  61. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  63. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online, August 2021. Association for Computational Linguistics.
  64. Finetuned language models are zero-shot learners. In Proceedings of ICLR, 2022.
  65. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.
  66. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021.
  67. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  68. Benchmarking generalization via in-context instructions on 1,600+ language tasks. In Proceedings of EMNLP, 2022.
  69. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  70. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  71. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
  72. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  73. Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. arXiv preprint arXiv:2305.05010, 2023.
  74. Minimum divergence vs. maximum margin: an empirical comparison on seq2seq models. In International Conference on Learning Representations, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuxian Gu (21 papers)
  2. Li Dong (154 papers)
  3. Furu Wei (291 papers)
  4. Minlie Huang (225 papers)
Citations (65)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com