Cosine Similarity Knowledge Distillation for Individual Class Information Transfer (2311.14307v1)
Abstract: Previous logits-based Knowledge Distillation (KD) have utilized predictions about multiple categories within each sample (i.e., class predictions) and have employed Kullback-Leibler (KL) divergence to reduce the discrepancy between the student and teacher predictions. Despite the proliferation of KD techniques, the student model continues to fall short of achieving a similar level as teachers. In response, we introduce a novel and effective KD method capable of achieving results on par with or superior to the teacher models performance. We utilize teacher and student predictions about multiple samples for each category (i.e., batch predictions) and apply cosine similarity, a commonly used technique in NLP for measuring the resemblance between text embeddings. This metric's inherent scale-invariance property, which relies solely on vector direction and not magnitude, allows the student to dynamically learn from the teacher's knowledge, rather than being bound by a fixed distribution of the teacher's knowledge. Furthermore, we propose a method called cosine similarity weighted temperature (CSWT) to improve the performance. CSWT reduces the temperature scaling in KD when the cosine similarity between the student and teacher models is high, and conversely, it increases the temperature scaling when the cosine similarity is low. This adjustment optimizes the transfer of information from the teacher to the student model. Extensive experimental results show that our proposed method serves as a viable alternative to existing methods. We anticipate that this approach will offer valuable insights for future research on model compression.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, 2006.
- Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11933–11942, 2022.
- Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
- Distilling knowledge via knowledge review. In CVPR, 2021a.
- Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5008–5017, 2021b.
- Otov2: Automatic, generic, user-friendly. arXiv preprint arXiv:2303.06862, 2023.
- Pruning deep neural networks from a sparsity perspective. arXiv preprint arXiv:2302.05601, 2023.
- A deep learning based technique for plagiarism detection: a comparative study. IAES International Journal of Artificial Intelligence, 9(1):81, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
- Deep residual learning for image recognition. In CVPR, 2016b.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- A comprehensive overhaul of feature distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1921–1930, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- Multi-level logit distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24276–24285, 2023.
- One-shot model for mixed-precision quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7939–7949, 2023.
- Learning multiple layers of features from tiny images. 2009.
- Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management, pp. 1–6. IEEE, 2016.
- Asymmetric temperature scaling makes larger networks teach well again. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 3830–3842. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/187d94b3c93343f0e925b5cf729eadd5-Paper-Conference.pdf.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131, 2018a.
- Shufflenet V2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018b.
- Recommender systems. Encyclopedia of machine learning, 1:829–838, 2010.
- Hieu V Nguyen and Li Bai. Cosine similarity metric learning for face verification. In Asian conference on computer vision, pp. 709–720. Springer, 2010.
- Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976, 2019.
- Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016, 2019.
- Cram: A compression-aware minimizer. arXiv preprint arXiv:2207.14200, 2022.
- Bibench: Benchmarking and analyzing network binarization. arXiv preprint arXiv:2301.11233, 2023.
- Better teacher better student: Dynamic prior knowledge for knowledge distillation. arXiv preprint arXiv:2206.06067, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- MobilenetV2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
- K. Simonyan and A Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, May 2015.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Collaborative distillation for ultra-resolution universal style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1860–1869, 2020.
- Jun Ye. Cosine similarity measures for intuitionistic fuzzy sets and their applications. Mathematical and computer modelling, 53(1-2):91–97, 2011.
- Wide residual networks. In BMVC, 2016.
- Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 11953–11962, 2022.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
- Gyeongdo Ham (3 papers)
- Seonghak Kim (22 papers)
- Suin Lee (4 papers)
- Jae-Hyeok Lee (5 papers)
- Daeshik Kim (6 papers)