Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models (2404.02657v4)
Abstract: Kullback-Leiber divergence has been widely used in Knowledge Distillation (KD) to compress LLMs. Contrary to prior assertions that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence, this study empirically and theoretically demonstrates that neither mode-seeking nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are found to share the same optimization objective and both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are seldom trained for such an extensive number of epochs. Meanwhile, we further find that RKL focuses on the tail part of the distributions, while FKL focuses on the head part at the beginning epochs. Consequently, we propose a simple yet effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively allocates weights to combine FKL and RKL. Metric-based and GPT-4-based evaluations demonstrate that the proposed AKL outperforms the baselines across various tasks and improves the diversity and quality of generated responses. Codes are available at \href{https://github.com/wutaiqiang/LLM_KD_AKL}{github}.
- On-policy distillation of language models: Learning from self-generated mistakes. 2024.
- Bd-kd: Balancing the divergences for online knowledge distillation. 2022.
- Greedification operators for policy optimization: Investigating forward and reverse KL divergences. J. Mach. Learn. Res., 23:253:1–253:79, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2(3):6, 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics, pp. 1331–1340. PMLR, 2019.
- A framework for few-shot language model evaluation, 12 2023.
- Openwebtext corpus, 2019.
- Knowledge distillation of large language models. CoRR, abs/2306.08543, 2023. doi: 10.48550/ARXIV.2306.08543.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806.
- Billm: Pushing the limit of post-training quantization for llms. 2024.
- Promptkd: Distilling student-friendly knowledge for generative language models via prompt tuning. arXiv preprint arXiv:2402.12842, 2024.
- Token-scaled logit distillation for ternary weight generative language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1317–1327, 2016.
- Distillm: Towards streamlined distillation for large language models. arXiv preprint arXiv:2402.03898, 2024.
- Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009. ISBN 978-0-262-01319-2.
- Self-knowledge distillation via dropout. Computer Vision and Image Understanding, 233:103720, 2023.
- Roberta: A robustly optimized bert pretraining approach. 2019.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Tom Minka et al. Divergence measures and message passing. Technical report, Technical report, Microsoft Research, 2005.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Kl guided domain adaptation. arXiv preprint arXiv:2106.07780, 2021.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- f𝑓fitalic_f-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
- A survey on transformer compression. 2024.
- Compression of generative pre-trained language models via quantization. pp. 4821–4836, 2022.
- Structured pruning for efficient generative pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10880–10895, 2023.
- Stanford alpaca: An instruction-following llama model, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/ARXIV.2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. 2023b.
- Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. 2023a.
- Riformer: Keep your vision backbone effective but removing token mixer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14443–14452, 2023b.
- Model compression and efficient inference for large language models: A survey. 2024.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340.
- f-divergence minimization for sequence-level knowledge distillation. 2023.
- Weight-inherited distillation for task-agnostic BERT compression. CoRR, abs/2305.09098, 2023. doi: 10.48550/ARXIV.2305.09098.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Tinyllama: An open-source small language model. 2024.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/ARXIV.2205.01068.