ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence (2505.04560v3)

Published 7 May 2025 in cs.LG

Abstract: Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.

PDF Abstract

An Evaluation of ABKD: Optimal Probability Mass Allocation in Knowledge Distillation

The paper "ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $\alpha$ - $\beta$ -Divergence" presents a novel framework for improving the efficiency and effectiveness of knowledge distillation (KD) processes. The authors tackle the longstanding issue of finding the optimal balance between two mode-concentration effects, namely Hardness-Concentration and Confidence-Concentration, in KD methods. Traditional approaches often rely on the forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD) to minimize divergence between teacher and student models, but these methods may lead to suboptimal performance due to extreme concentration effects.

Core Concepts

Knowledge distillation is a method to transfer knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, FKLD and RKLD have been used to match the output distributions of the teacher and student models. However, FKLD tends to produce overly smooth student distributions, while RKLD can focus too narrowly on the target class, potentially ignoring broader distributional insights provided by the teacher model.

This paper proposes a new approach, labeled ABKD, which introduces a generalized divergence function, $\alpha$ - $\beta$ -divergence, to better balance these effects. The $\alpha$ and $\beta$ parameters allow for a smooth interpolation between FKLD and RKLD, offering a more flexible trade-off between concentration effects. This framework aims to address the issues inherent in FKLD and RKLD, providing a more nuanced allocation of probability mass during the KD process.

Theoretical Insights

ABKD is built upon theoretical insights regarding mode-concentration effects. The paper demonstrates that carefully tuning $\alpha$ and $\beta$ allows the $\alpha$ - $\beta$ -divergence to harmonize these effects, effectively reallocating probability mass during training. By analyzing the log mass ratio, the authors illustrate how different divergence functions influence gradient update mechanisms. This deep understanding leads to an algorithmic solution that can adjust concentration effects dynamically based on the distributions of teacher and student models.

Experimental Validation

The authors present extensive empirical evidence across 17 language and vision datasets with 12 teacher-student configurations, ranging from small to large models. With modifications limited to loss functions, ABKD achieves significant improvements over traditional FKLD and RKLD methods. For instance, distilling GPT-2 XL into GPT-2 using ABKD resulted in performance enhancements of 0.81 to 3.31 ROUGE-L points compared to prior methods.

Practical Implications

The proposed ABKD framework is versatile, extending beyond FKLD and RKLD to encompass other divergence measures and offering greater adaptability. Its plug-and-play nature allows for rectifying loss functions in existing distillation methods—landscaping future applications in the AI domain. This opens avenues for more efficient model training processes, whether in resource-constrained environments or for large-scale neural networks.

Future Directions

The introduction of the $\alpha$ - $\beta$ -divergence and the ABKD framework marks a significant step forward in the KD domain. Future work could explore the impact of this framework on other fields requiring model compression or adaptation, such as autonomous driving or healthcare AI systems. Further research could investigate optimizing $\alpha$ and $\beta$ dynamically during training, potentially leading to even more responsive and effective distillation techniques.

In summary, the paper provides substantial theoretical and empirical advancements in the field of knowledge distillation. By addressing fundamental challenges inherent in existing approaches, ABKD offers a promising direction for achieving more accurate and reliable model distillation across a wide range of applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Guanghui Wang (179 papers)
Zhiyong Yang (43 papers)
Zitai Wang (15 papers)
Shi Wang (47 papers)
Qianqian Xu (74 papers)
Qingming Huang (168 papers)