Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scale Decoupled Distillation (2403.13512v1)

Published 20 Mar 2024 in cs.CV and cs.AI

Abstract: Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7028–7036, 2021a.
  2. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942, 2022.
  3. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021b.
  4. Jia Deng. A large-scale hierarchical image database. Proc. of IEEE Computer Vision and Pattern Recognition, 2009, 2009.
  5. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  6. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
  7. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  8. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2017.
  9. Learning multiple layers of features from tiny images. 2009.
  10. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
  11. Nikolaos et.al. Passalis. In CVPR2020, pages 2339–2348, 2020.
  12. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  13. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  14. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  15. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1365–1374, 2019.
  16. Privileged modality learning via multimodal hallucination. IEEE Transactions on Multimedia, 2023.
  17. Caltech-ucsd birds 200. 2010.
  18. Knowledge distillation meets self-supervision. In European Conference on Computer Vision, pages 588–604. Springer, 2020a.
  19. Feature normalized knowledge distillation for image classification. In European Conference on Computer Vision, pages 664–680. Springer, 2020b.
  20. Kdexplainer: A task-oriented attention model for explaining knowledge distillation. arXiv preprint arXiv:2105.04181, 2021.
  21. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2020.
  22. From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. arXiv preprint arXiv:2303.13005, 2023.
  23. Knowledge transfer via dense cross-layer mutual-distillation. In European Conference on Computer Vision, pages 294–311. Springer, 2020.
  24. Decoupled knowledge distillation. arXiv preprint arXiv:2203.08679, 2022.
  25. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.
  26. Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
Citations (1)