Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

122

DistiLLM: Towards Streamlined Distillation for Large Language Models (2402.03898v2)

Published 6 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., LLMs) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive LLMs. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

PDF HTML Abstract

Overview of DistiLLM Framework

DistiLLM is a knowledge distillation (KD) framework designed to efficiently transfer knowledge from LLMs to smaller counterparts. The framework addresses two significant challenges: the absence of a standardized objective function and high computational costs associated with student-generated outputs (SGO) during training.

Introduction to Knowledge Distillation Challenges

The primary goal of KD is to condense the knowledge of a cumbersome teacher model into a more agile student model, preserving performance while reducing computational load. Despite its potential, KD for LLMs has faced hurdles due to non-standardized loss functions and disparities between training and inference data distributions, known as exposure bias. These challenges have led to suboptimal results, particularly for generative tasks, where student models fail to adequately capture the complexity of the teacher's output distribution, resulting in either overly concentrated or over-smoothed distributions.

Innovations in DistiLLM

The DistiLLM framework presents two innovations: a skew Kullback-Leibler (KLD) divergence loss and an adaptive off-policy approach. The skew KLD introduces a parameter that skews the mixing between teacher and student distributions, theoretically optimizing stability and convergence. Empirical results indicate faster convergence and superior performance compared to conventional KLD approaches.

The adaptive off-policy approach efficiently leverages SGOs while managing the risk of noisy feedback and reducing the computational burden. By adaptively adjusting the reliance on SGOs based on model performance insights, DistiLLM achieves substantial training speed improvements—up to 4.3 times faster than recent KD methods—without compromising the student model's capabilities.

Empirical Validation and Performance

Extensive experiments on tasks such as instruction-following, text summarization, and machine translation validate the efficacy of DistiLLM. Not only does it achieve state-of-the-art performance for student LLMs across a variety of generative tasks, but it also offers a much-needed speedup in training time. Particularly notable is its ability to consistently outperform existing KD methodologies while operating within constrained computational budgets.

Conclusion

The DistiLLM framework significantly advances the efficient distillation of LLMs. It not only overcomes the previous challenges associated with KD but also sets a new standard in producing capable and efficient smaller LLMs. Its dual focus on effective knowledge transfer and training efficiency renders it instrumental for broader adoption and deployment of LLMs in resource-limited environments.

PDF Markdown Bookmark Chat (Pro)

References (52)

Authors (4)

Jongwoo Ko (20 papers)
Sungnyun Kim (19 papers)
Tianyi Chen (139 papers)
Se-Young Yun (114 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/arankomatsuzaki/status/1755059693687275584

https://twitter.com/gm8xx8/status/1755059838910890044

https://twitter.com/n0riskn0r3ward/status/1830305801497354515

https://twitter.com/arxivsanitybot/status/1755221498220966136

https://twitter.com/gm8xx8/status/1755063824485400800