Logit Standardization in Knowledge Distillation (2403.01427v1)
Abstract: Knowledge distillation involves transferring soft labels from a teacher to a student using a shared temperature-based softmax function. However, the assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. This side-effect limits the performance of student, considering the capacity discrepancy between them and the finding that the innate logit relations of teacher are sufficient for student to learn. To address this issue, we propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization before applying softmax and Kullback-Leibler divergence. Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods. We also show a typical case where the conventional setting of sharing temperature between teacher and student cannot reliably yield the authentic distillation evaluation; nonetheless, this challenge is successfully alleviated by our Z-score. We extensively evaluate our method for various student and teacher models on CIFAR-100 and ImageNet, showing its significant superiority. The vanilla knowledge distillation powered by our pre-process can achieve favorable performance against state-of-the-art methods, and other distillation variants can obtain considerable gain with the assistance of our pre-process.
- Variational information distillation for knowledge transfer. In CVPR, 2019.
- Revisiting label smoothing and knowledge distillation compatibility: What was missing? In ICML, 2022.
- Online knowledge distillation with diverse peers. In AAAI, 2020.
- Knowledge distillation with the reused teacher classifier. In CVPR, 2022a.
- Distilling knowledge via knowledge review. In CVPR, 2021.
- Dearkd: Data-efficient early knowledge distillation for vision transformers. In CVPR, 2022b.
- On the efficacy of knowledge distillation. In ICCV, 2019.
- On calibration of modern neural networks. In ICML, 2017.
- Jia Guo. Reducing the teacher-student gap via adaptive temperatures, 2022.
- Class attention transfer based knowledge distillation. In CVPR, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- A comprehensive overhaul of feature distillation. In ICCV, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Knowledge distillation from a stronger teacher. NeurIPS, 2022.
- Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
- Multi-level logit distillation. In CVPR, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Knowledge distillation for object detection via rank mimicking and prediction-guided feature imitation. In AAAI, 2022a.
- Locality guidance for improving vision transformers on tiny datasets. In ECCV, 2022b.
- Automated knowledge distillation via monte carlo tree search. In ICCV, 2023a.
- Online knowledge distillation via multi-branch diversity enhancement. In ACCV, 2020.
- Online knowledge distillation for efficient pose estimation. In ICCV, 2021.
- Curriculum temperature for knowledge distillation. In AAAI, 2023b.
- Knowledge distillation via the target-aware transformer. In CVPR, 2022.
- Meta knowledge distillation. arXiv preprint arXiv:2202.07940, 2022.
- Structured knowledge distillation for semantic segmentation. In CVPR, 2019.
- Improved knowledge distillation via teacher assistant. In AAAI, 2020.
- Relational knowledge distillation. In CVPR, 2019.
- Correlation congruence for knowledge distillation. In ICCV, 2019.
- Fitnets: Hints for thin deep nets. ICLR, 2015.
- Imagenet large scale visual recognition challenge. IJCV, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Densely guided knowledge distillation using multiple teacher assistants. In ICCV, 2021.
- On the importance of initialization and momentum in deep learning. In ICML, 2013.
- Contrastive representation distillation. ICLR, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Similarity-preserving knowledge distillation. In ICCV, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE TPAMI, 2021.
- Tinyvit: Fast pretraining distillation for small vision transformers. In ECCV, 2022.
- Knowledge distillation via softmax regression representation learning. In ICLR, 2021.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, 2017.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ICLR, 2017.
- Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. arXiv preprint arXiv:2305.05010, 2023.
- Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018a.
- Deep mutual learning. In CVPR, 2018b.
- Decoupled knowledge distillation. In CVPR, 2022.
- Shangquan Sun (6 papers)
- Wenqi Ren (67 papers)
- Jingzhi Li (64 papers)
- Rui Wang (996 papers)
- Xiaochun Cao (177 papers)