Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoupled Knowledge Distillation (2203.08679v2)

Published 16 Mar 2022 in cs.CV and cs.AI

Abstract: State-of-the-art distillation methods are mainly based on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. To provide a novel viewpoint to study logit distillation, we reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). We empirically investigate and prove the effects of the two parts: TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works. More importantly, we reveal that the classical KD loss is a coupled formulation, which (1) suppresses the effectiveness of NCKD and (2) limits the flexibility to balance these two parts. To address these issues, we present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly. Compared with complex feature-based methods, our DKD achieves comparable or even better results and has better training efficiency on CIFAR-100, ImageNet, and MS-COCO datasets for image classification and object detection tasks. This paper proves the great potential of logit distillation, and we hope it will be helpful for future research. The code is available at https://github.com/megvii-research/mdistiller.

Decoupled Knowledge Distillation

The paper "Decoupled Knowledge Distillation" addresses limitations in traditional knowledge distillation (KD) methods, focusing on the distillation of logits rather than deep features. It proposes a novel reformulation that decouples the KD loss into two components: Target Class Knowledge Distillation (TCKD) and Non-Target Class Knowledge Distillation (NCKD). This approach reveals and rectifies the inefficiencies inherent in classical logit-based distillation.

Key Contributions

  1. Reformulation of KD Loss: The authors decompose the KD loss into TCKD and NCKD components to independently assess their contributions. This separation allows for a clearer understanding of each component's role in the distillation process.
  2. Insights into TCKD and NCKD:
    • TCKD: This transfers knowledge about the difficulty of training samples, proving particularly beneficial in challenging scenarios like noisy datasets or where strong data augmentation is applied.
    • NCKD: Found to be the crucial component, providing significant performance gains by transferring knowledge among non-target classes. It mitigates the suppression effects caused by excessive confidence of the teacher in well-predicted samples.
  3. Decoupled Knowledge Distillation (DKD): The paper introduces DKD to independently adjust the weights of TCKD and NCKD, overcoming the coupling limitations. This method allows for a more flexible and effective knowledge transfer process.
  4. Practical Impact and Theoretical Insights: DKD outperforms existing feature-based methods in tasks such as image classification on CIFAR-100 and ImageNet, as well as object detection on MS-COCO, thereby demonstrating its superior training efficiency.

Implications and Future Directions

  • Training Efficiency: DKD significantly reduces the computational and storage burden compared to feature-based methods, making it a practical choice for industrial applications with resource constraints.
  • Interpretation of Teacher-Student Dynamics: By revisiting the effect of teacher confidence on distillation effectiveness, the work offers fresh insights into why larger models might not always serve as better teachers—a phenomenon the DKD approach helps to alleviate.
  • Potential for Broader Application: Although primarily focused on image-related tasks, the techniques outlined have potential applications in other domains utilizing neural networks, such as natural language processing or reinforcement learning.
  • Future Research on Hyperparameter Optimization: The paper provides preliminary guidelines for setting decoupling weights; however, detailed exploration into automated tuning mechanisms could further enhance the robustness and applicability of DKD.

In conclusion, the paper contributes a nuanced perspective on logit-based knowledge distillation, challenging and expanding upon traditional paradigms by effectively leveraging both target and non-target class information. It lays a solid foundation for future advances in efficient AI model training strategies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Borui Zhao (13 papers)
  2. Quan Cui (10 papers)
  3. Renjie Song (12 papers)
  4. Yiyu Qiu (1 paper)
  5. Jiajun Liang (37 papers)
Citations (429)