Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation

Published 31 Mar 2026 in cs.LG and cs.AI | (2604.00223v1)

Abstract: Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for LLM distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents DRKL, a modified reverse KL objective that decouples target and non-target gradients to improve calibration and output diversity.
It demonstrates that DRKL outperforms traditional RKL and FKL objectives, yielding higher fidelity scores and enhanced diversity in distilled large language models.
Empirical evaluations show DRKL achieves over a 2-point improvement in ROUGE-L scores along with reduced overconfidence across various instruction-following benchmarks.

Diversity-Aware Reverse Kullback-Leibler Divergence for LLM Distillation

Introduction and Problem Formulation

The paper "Diversity-Aware Reverse Kullback-Leibler Divergence for LLM Distillation" (2604.00223) addresses the loss landscape and learning dynamics of knowledge distillation (KD) for LLMs, focusing specifically on the use of forward KL (FKL) versus reverse KL (RKL) divergence as a distillation objective. While RKL has empirically surpassed FKL, especially in high-vocabulary and significant teacher-student capacity mismatch regimes, the authors provide a theoretical and empirical analysis revealing structural shortcomings in RKL—namely its persistent drive for overconfident, low-diversity predictions and poor long-tail alignment.

Structural Differences Between FKL and RKL in LLM Distillation

The fundamental difference is that FKL enforces dense matching between student and teacher across the entire output space, which is infeasible for small students in the LLM regime due to heavy-tailed output distributions. Conversely, RKL focuses mainly on the dominant teacher modes, ignoring the non-target, low-probability tokens. This is demonstrated via gradient analysis: FKL provides uniform supervision, while RKL's update is suppressed for non-targets, aligning only with the portion of the distribution the student already believes is important. This mode-seeking property enables faster convergence and better alignment in limited-capacity students, as illustrated in

Figure 1: FKL versus RKL learning dynamics in differing output space sizes, showing faster RKL convergence in the large-vocabulary setting.

Gradient Decomposition: Target and Non-Target Effects

A novel gradient decomposition for RKL within large-vocab LLMs is presented. The decomposition reveals that non-target gradients (from low-probability classes) systematically push up the target logit, even when the student already matches the teacher on the target class. This mechanism amplifies output confidence beyond the teacher's, resulting in reduced response diversity and calibration errors. Furthermore, RKL provides only weak supervision on non-target (tail) class alignment.

Diversity-Aware RKL (DRKL): Objective and Theoretical Properties

To mitigate these issues, the authors introduce DRKL, a modified reverse KL objective. DRKL directly removes the gradient pathway from non-target losses to the target logit by reweighting the non-target term with a fixed hyperparameter $\gamma$ , thereby decoupling target and non-target gradients. This enforces regularization on the "tail" (non-target) classes without allowing non-target mismatch to propagate overconfidence in the target logit. Theoretical results validate that DRKL preserves the efficient optimization structure of RKL while providing strictly better alignment and calibration.

Experimental Evaluation

Extensive experiments are conducted on LLM instruction-following tasks, spanning multiple teacher-student pairs across GPT-2 and OPT families (up to a 6.7B-to-1.3B distillation). Students are trained with DRKL and a diversity of strong baselines, including FKL, RKL, various bidirectional and symmetric KL designs, Jensen-Shannon divergence, decoupled KD variants, and recently proposed skewed/interpolated KL metrics. Evaluation employs both fidelity (ROUGE-L) and output diversity metrics (Distinct-2, Negative Self-BLEU).

Empirically, DRKL consistently outperforms all baselines across a range of benchmarks and student architectures, reporting the top scores in ROUGE-L on all main evaluation sets. In detailed ablations, the method shows consistently superior calibration, significantly reduced overconfidence, and improved diversity as measured by diversity metrics. Notably, it surpasses RKL by over 2 points on the Super-NI and UnNI benchmarks with a comparable computational budget.

Figure 2: Trade-off between fidelity and output diversity: DRKL balances high ROUGE-L score with improved diversity compared to RKL and FKL.

These improvements are achieved without significant sensitivity to the hyperparameter $\gamma$ and with minimal per-step training cost increase. Qualitative case studies on instruction-following demonstrate that DRKL-trained models produce more instruction-faithful and semantically correct responses compared to overfit or degenerate outputs commonly produced by vanilla RKL or FKL.

Practical and Theoretical Implications

The results indicate that RKL's presumed dominance in LLM distillation is only partial, and its overconfidence and diversity collapse effects are ubiquitous under realistic capacity constraints. DRKL establishes that it is possible to decouple the benefits of rapid mode alignment (due to RKL) from the drawbacks of under-regularized output tails.

From an application perspective, the improved fidelity-diversity trade-off enhances instruction-following, robustness, and the generalization characteristics of distilled LLMs. Architecturally, the simplicity of DRKL makes it composable with other RKL-based strategies, suggesting that further improvements may be obtained by integrating DRKL into broader LLM distillation toolkits and curriculum designs.

Future Directions

Subsequent work could focus on extending DRKL to online distillation, RL fine-tuning, or hybrid objectives that directly measure user-preferred behavior. There is also scope for formalizing the impact of diversity regularization on downstream tasks such as question answering, summarization, and dialog generation, as well as applying the DRKL mechanism to other structured prediction domains with large output spaces (e.g., vision, combinatorial generation).

Conclusion

This paper provides a rigorous theoretical and empirical foundation for the limitations of RKL as a distillation objective in LLMs, and presents DRKL as an effective, simple, and scalable alternative that achieves superior performance and calibration, balancing fidelity and diversity. The findings refine the understanding of loss design in LLM distillation and will inform future development of compact, robust, and diverse LLMs.

Markdown Report Issue