Dynamic Temperature Knowledge Distillation (2404.12711v1)
Abstract: Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at https://github.com/JinYu1998/DTKD.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021.
- Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11933–11942, 2022.
- Normkd: Normalized logits for knowledge distillation. arXiv preprint arXiv:2308.00520, 2023.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Annealing knowledge distillation. arXiv preprint arXiv:2104.07163, 2021.
- Learning multiple layers of features from tiny images. 2009.
- Learning from noisy labels with distillation. In Proceedings of the IEEE international conference on computer vision, pages 1910–1918, 2017.
- Asymmetric temperature scaling makes larger networks teach well again. Advances in Neural Information Processing Systems, 35:3830–3842, 2022.
- Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1504–1512, 2023.
- Meta knowledge distillation. arXiv preprint arXiv:2202.07940, 2022.
- Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532, 2020.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020.
- Student-friendly knowledge distillation. arXiv preprint arXiv:2305.10893, 2023.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
- Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
- Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.