Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation (2505.13111v1)

Published 19 May 2025 in cs.LG

Abstract: Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly LLMs. While its empirical benefits are well documented--enabling smaller student models to emulate the performance of much larger teachers--the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage--a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale LLMing setup using the SmoLLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off proves especially beneficial in scenarios where sample quality outweighs diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

Tweets

https://twitter.com/bronzeagepapi/status/1928961748062646773

https://twitter.com/bohannon_bot/status/1925240229738561601