Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation (2312.15112v3)
Abstract: Knowledge distillation aims to train a compact student network using soft supervision from a larger teacher network and hard supervision from ground truths. However, determining an optimal knowledge fusion ratio that balances these supervisory signals remains challenging. Prior methods generally resort to a constant or heuristic-based fusion ratio, which often falls short of a proper balance. In this study, we introduce a novel adaptive method for learning a sample-wise knowledge fusion ratio, exploiting both the correctness of teacher and student, as well as how well the student mimics the teacher on each sample. Our method naturally leads to the intra-sample trilateral geometric relations among the student prediction ($S$), teacher prediction ($T$), and ground truth ($G$). To counterbalance the impact of outliers, we further extend to the inter-sample relations, incorporating the teacher's global average prediction $\bar{T}$ for samples within the same class. A simple neural network then learns the implicit mapping from the intra- and inter-sample relations to an adaptive, sample-wise knowledge fusion ratio in a bilevel-optimization manner. Our approach provides a simple, practical, and adaptable solution for knowledge distillation that can be employed across various architectures and model sizes. Extensive experiments demonstrate consistent improvements over other loss re-weighting methods on image classification, attack detection, and click-through rate prediction.
- BAM! born-again multi-task networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5931–5937, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568–1577. PMLR, 2018.
- I. J. Good. Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological), (1):107–114, 1952.
- Deepfm: A factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1725–1731, 2017.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2016.
- Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3779–3787, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Teacher-student architecture for knowledge distillation: A survey. CoRR, abs/2308.04268, 2023.
- Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536, 2022.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
- Olivier Chapelle Jean-Baptiste Tien, joycenv. Display advertising challenge, 2014. URL https://kaggle.com/competitions/criteo-display-ad-challenge.
- James M. Joyce. Kullback-Leibler Divergence, pages 720–722. 2011.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Deep geometric knowledge distillation with graphs. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8484–8488, 2020.
- A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in Neural Information Processing Systems, 31:7167–7177, 2018.
- Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.
- RW-KD: Sample-wise loss terms re-weighting for knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP, pages 3145–3152, 2021.
- Teacher’s pet: understanding and mitigating biases in distillation. arXiv preprint arXiv:2106.10494, 2021.
- Memorize, factorize, or be naive: Learning optimal feature interaction methods for CTR prediction. In Proceedings of 38th IEEE International Conference on Data Engineering, pages 1450–1462, 2022.
- Developing a hybrid intrusion detection system using data mining for power systems. IEEE Transactions on Smart Grid, pages 3104–3113, 2015.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Contrastive representation distillation. In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020.
- Attention is all you need. Advances in Neural Information Processing Systems, 30:5998–6008, 2017.
- Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- Out-of-distribution detection via conditional kernel independence model. Advances in Neural Information Processing Systems, 35:36411–36425, 2022.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the 5th International Conference on Learning Representations, ICLR, 2017.
- Learning deep features for discriminative localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
- Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint arXiv:2102.00650, 2021.