Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DistiLLM: Towards Streamlined Distillation for Large Language Models (2402.03898v2)

Published 6 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., LLMs) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive LLMs. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

Overview of DistiLLM Framework

DistiLLM is a knowledge distillation (KD) framework designed to efficiently transfer knowledge from LLMs to smaller counterparts. The framework addresses two significant challenges: the absence of a standardized objective function and high computational costs associated with student-generated outputs (SGO) during training.

Introduction to Knowledge Distillation Challenges

The primary goal of KD is to condense the knowledge of a cumbersome teacher model into a more agile student model, preserving performance while reducing computational load. Despite its potential, KD for LLMs has faced hurdles due to non-standardized loss functions and disparities between training and inference data distributions, known as exposure bias. These challenges have led to suboptimal results, particularly for generative tasks, where student models fail to adequately capture the complexity of the teacher's output distribution, resulting in either overly concentrated or over-smoothed distributions.

Innovations in DistiLLM

The DistiLLM framework presents two innovations: a skew Kullback-Leibler (KLD) divergence loss and an adaptive off-policy approach. The skew KLD introduces a parameter that skews the mixing between teacher and student distributions, theoretically optimizing stability and convergence. Empirical results indicate faster convergence and superior performance compared to conventional KLD approaches.

The adaptive off-policy approach efficiently leverages SGOs while managing the risk of noisy feedback and reducing the computational burden. By adaptively adjusting the reliance on SGOs based on model performance insights, DistiLLM achieves substantial training speed improvements—up to 4.3 times faster than recent KD methods—without compromising the student model's capabilities.

Empirical Validation and Performance

Extensive experiments on tasks such as instruction-following, text summarization, and machine translation validate the efficacy of DistiLLM. Not only does it achieve state-of-the-art performance for student LLMs across a variety of generative tasks, but it also offers a much-needed speedup in training time. Particularly notable is its ability to consistently outperform existing KD methodologies while operating within constrained computational budgets.

Conclusion

The DistiLLM framework significantly advances the efficient distillation of LLMs. It not only overcomes the previous challenges associated with KD but also sets a new standard in producing capable and efficient smaller LLMs. Its dual focus on effective knowledge transfer and training efficiency renders it instrumental for broader adoption and deployment of LLMs in resource-limited environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW.
  2. The falcon series of language models: Towards open frontier models. https://arxiv.org/abs/2311.16867, 2023.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  700–710, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.58. URL https://aclanthology.org/2022.findings-acl.58.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the 14th International Conference on Spoken Language Translation, pp.  2–14, Tokyo, Japan, December 14-15 2017. International Workshop on Spoken Language Translation. URL https://aclanthology.org/2017.iwslt-1.1.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable- instruction-tuned-llm.
  9. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp. 3061–3071. PMLR, 2020.
  10. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  11. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp.  70–79, 2019.
  12. Openwebtext corpus, 2019.
  13. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ.
  14. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015. URL https://api.semanticscholar.org/CorpusID:7200347.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
  16. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  18. Tailoring language generation models under total variation distance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VELL0PlWfc.
  19. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1317–1327, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1139. URL https://aclanthology.org/D16-1139.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  22. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  23. Rényicl: Contrastive representation learning with skew rényi divergence. Advances in Neural Information Processing Systems, 35:6463–6477, 2022.
  24. Lee, L. On the effectiveness of the skew divergence for statistical language analysis. In International Workshop on Artificial Intelligence and Statistics, pp.  176–183. PMLR, 2001.
  25. Autoregressive knowledge distillation through imitation learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6121–6133, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.494. URL https://aclanthology.org/2020.emnlp-main.494.
  26. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  27. Divergence frontiers for generative models: Sample complexity, quantization effects, and frontier integrals. Advances in Neural Information Processing Systems, 34:12930–12942, 2021.
  28. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  29. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5191–5198, 2020.
  30. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  31. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797–1807, 2018.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  35. Pedro, D. A unified bias-variance decomposition and its applications. In 17th International Conference on Machine Learning, pp. 231–238, 2000.
  36. Instruction tuning with gpt-4, 2023.
  37. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  40. Practical and consistent estimation of f-divergences. Advances in Neural Information Processing Systems, 32, 2019.
  41. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. URL https://api.semanticscholar.org/CorpusID:203626972.
  42. Get to the point: Summarization with pointer-generator networks. In Barzilay, R. and Kan, M.-Y. (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://aclanthology.org/P17-1099.
  43. Patient knowledge distillation for BERT model compression. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4323–4332, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1441. URL https://aclanthology.org/D19-1441.
  44. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879, 2023a.
  47. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  48. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  49. f-divergence minimization for sequence-level knowledge distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  10817–10834, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.605. URL https://aclanthology.org/2023.acl-long.605.
  50. mT5: A massively multilingual pre-trained text-to-text transformer. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  51. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  52. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jongwoo Ko (20 papers)
  2. Sungnyun Kim (19 papers)
  3. Tianyi Chen (139 papers)
  4. Se-Young Yun (114 papers)
Citations (13)