Knowledge From the Dark Side: Entropy-Reweighted Knowledge Distillation for Balanced Knowledge Transfer (2311.13621v1)
Abstract: Knowledge Distillation (KD) transfers knowledge from a larger "teacher" model to a compact "student" model, guiding the student with the "dark knowledge" $\unicode{x2014}$ the implicit insights present in the teacher's soft predictions. Although existing KDs have shown the potential of transferring knowledge, the gap between the two parties still exists. With a series of investigations, we argue the gap is the result of the student's overconfidence in prediction, signaling an imbalanced focus on pronounced features while overlooking the subtle yet crucial dark knowledge. To overcome this, we introduce the Entropy-Reweighted Knowledge Distillation (ER-KD), a novel approach that leverages the entropy in the teacher's predictions to reweight the KD loss on a sample-wise basis. ER-KD precisely refocuses the student on challenging instances rich in the teacher's nuanced insights while reducing the emphasis on simpler cases, enabling a more balanced knowledge transfer. Consequently, ER-KD not only demonstrates compatibility with various state-of-the-art KD methods but also further enhances their performance at negligible cost. This approach offers a streamlined and effective strategy to refine the knowledge transfer process in KD, setting a new paradigm in the meticulous handling of dark knowledge. Our code is available at https://github.com/cpsu00/ER-KD.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901. Curran Associates, Inc., 2020.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5008–5017, 2021.
- Adaptive distillation: Aggregating knowledge from multiple paths for efficient distillation. British Machine Vision Conference (BMVC), 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Multi-level logit distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24276–24285, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
- Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
- Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1504–1512, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017a.
- Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017b.
- Function-consistent feature distillation. In The Eleventh International Conference on Learning Representations, 2022a.
- Meta knowledge distillation. arXiv preprint arXiv:2202.07940, 2022b.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- RW-KD: Sample-wise loss terms re-weighting for knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3145–3152, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, pages 5191–5198, 2020.
- Revisiting self-distillation. arXiv preprint arXiv:2206.08491, 2022.
- Learning to reweight examples for robust deep learning. In International conference on machine learning, pages 4334–4343. PMLR, 2018.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
- Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society, 2015.
- Spot-adaptive knowledge distillation. IEEE Transactions on Image Processing, 31:3359–3370, 2022.
- Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532, 2020.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
- Rethinking soft labels for knowledge distillation: A bias–variance tradeoff perspective. In International Conference on Learning Representations, 2021.
- Student customized knowledge distillation: Bridging the gap between student and teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5057–5066, 2021.
- Better teacher better student: Dynamic prior knowledge for knowledge distillation. In The Eleventh International Conference on Learning Representations, 2023.
- Chi-Ping Su (1 paper)
- Ching-Hsun Tseng (7 papers)
- Shin-Jye Lee (6 papers)