CKD: Contrastive Knowledge Distillation from A Sample-wise Perspective (2404.14109v1)
Abstract: In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at https://github.com/wencheng-zhu/CKD.
- G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, and C. Guo, “Knowledge distillation from internal representations,” in AAAI, 2020, pp. 7350–7357.
- Z. Allen-Zhu and Y. Li, “Towards understanding ensemble, knowledge distillation and self-distillation in deep learning,” in ICLR, 2023, pp. 1–12.
- P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in NeurIPS, 2019, pp. 15 509–15 519.
- T. Bai, J. Chen, J. Zhao, B. Wen, X. Jiang, and A. Kot, “Feature distillation with guided adversarial contrastive learning,” arXiv preprint arXiv:2009.09922, 2020.
- L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov, “Knowledge distillation: A good teacher is patient and consistent,” in CVPR, 2022, pp. 10 925–10 934.
- D. Chen, J.-P. Mei, H. Zhang, C. Wang, Y. Feng, and C. Chen, “Knowledge distillation with the reused teacher classifier,” in CVPR, 2022, pp. 11 933–11 942.
- P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in CVPR, 2021, pp. 5008–5017.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
- X. Cheng, Z. Rao, Y. Chen, and Q. Zhang, “Explaining knowledge distillation by quantifying the knowledge,” in CVPR, 2020, pp. 12 925–12 935.
- J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” in ICCV, 2019, pp. 4794–4802.
- Y. Cho, G. Ham, J.-H. Lee, and D. Kim, “Ambiguity-aware robust teacher (art): Enhanced self-knowledge distillation framework with pruned teacher network,” PR, vol. 140, pp. 1–12, 2023.
- K. Cui, Y. Yu, F. Zhan, S. Liao, S. Lu, and E. P. Xing, “KD-DLGAN: Data limited image generation via knowledge distillation,” in CVPR, 2023, pp. 3872–3882.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
- P. Dong, L. Li, and Z. Wei, “DisWOT: Student architecture search for distillation without training,” in CVPR, 2023, pp. 11 898–11 908.
- J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” IJCV, vol. 129, pp. 1789–1819, 2021.
- Z. Guo, W. Shiao, S. Zhang, Y. Liu, N. V. Chawla, N. Shah, and T. Zhao, “Linkless link prediction via relational distillation,” in ICML, 2023, pp. 12 012–12 033.
- Z. Guo, H. Yan, H. Li, and X. Lin, “Class attention transfer based knowledge distillation,” in CVPR, 2023, pp. 11 868–11 877.
- M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.” JMLR, vol. 13, no. 2, pp. 307–361, 2012.
- Z. Hao, J. Guo, K. Han, H. Hu, C. Xu, and Y. Wang, “VanillaKD: Revisit the power of vanilla knowledge distillation from small scale to large scale,” arXiv preprint arXiv:2305.15781, 2023.
- Z. Hao, J. Guo, K. Han, Y. Tang, H. Hu, Y. Wang, and C. Xu, “One-for-All: Bridge the gap between heterogeneous architectures in knowledge distillation,” in NeurIPS, 2023, pp. 1–13.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in ICML, 2020, pp. 4182–4192.
- B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, “A comprehensive overhaul of feature distillation,” in ICCV, 2019, pp. 1921–1930.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” NeurIPS, vol. 35, pp. 33 716–33 727, 2022.
- Y. Jin, J. Wang, and D. Lin, “Multi-level logit distillation,” in CVPR, 2023, pp. 24 276–24 285.
- P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” NeurIPS, vol. 33, pp. 18 661–18 673, 2020.
- Y. Kim, J. Park, Y. Jang, M. Ali, T.-H. Oh, and S.-H. Bae, “Distilling global and local logits with densely connected relations,” in ICCV, 2021, pp. 6290–6300.
- M. Klingner, S. Borse, V. R. Kumar, B. Rezaei, V. Narayanan, S. Yogamani, and F. Porikli, “X3KD: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection,” in CVPR, 2023, pp. 13 343–13 353.
- N. Komodakis and S. Zagoruyko, “Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017, pp. 1–13.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.
- J. Li, Z. Guo, H. Li, S. Han, J.-w. Baek, M. Yang, R. Yang, and S. Suh, “Rethinking feature-based knowledge distillation for face recognition,” in CVPR, 2023, pp. 20 156–20 165.
- K. Li, J. Wan, and S. Yu, “Ckdf: Cascaded knowledge distillation framework for robust incremental learning,” TIP, vol. 31, pp. 3825–3837, 2022.
- L. Li, P. Dong, Z. Wei, and Y. Yang, “Automated knowledge distillation via monte carlo tree search,” in ICCV, 2023, pp. 17 413–17 424.
- Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network for object detection,” in CVPR, 2017, pp. 6356–6364.
- X. Li, S. Li, B. Omar, F. Wu, and X. Li, “Reskd: Residual-guided knowledge distillation,” TIP, vol. 30, pp. 4735–4746, 2021.
- Z. Li, X. Li, X. Fu, X. Zhang, W. Wang, S. Chen, and J. Yang, “Promptkd: Unsupervised prompt distillation for vision-language models,” in CVPR, 2024.
- Z. Li, X. Li, L. Yang, B. Zhao, R. Song, L. Luo, J. Li, and J. Yang, “Curriculum temperature for knowledge distillation,” in AAAI, 2023, pp. 1504–1512.
- Z. Li, P. Xu, X. Chang, L. Yang, Y. Zhang, L. Yao, and X. Chen, “When object detection meets knowledge distillation: A survey,” TPAMI, vol. 45, no. 8, pp. 10 555–10 579, 2023.
- H. Lin, G. Han, J. Ma, S. Huang, X. Lin, and S.-F. Chang, “Supervised masked knowledge distillation for few-shot transformers,” in CVPR, 2023, pp. 19 649–19 659.
- S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, “Knowledge distillation via the target-aware transformer,” in CVPR, 2022, pp. 10 915–10 924.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014, pp. 740–755.
- L. Liu, Q. Huang, S. Lin, H. Xie, B. Wang, X. Chang, and X. Liang, “Exploring inter-channel correlation for diversity-preserved knowledge distillation,” in ICCV, 2021, pp. 8271–8280.
- Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured knowledge distillation for semantic segmentation,” in CVPR, 2019, pp. 2604–2613.
- Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, “Knowledge distillation via instance relationship graph,” in CVPR, 2019, pp. 7096–7104.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in ECCV, 2018, pp. 116–131.
- C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and T. Salimans, “On distillation of guided diffusion models,” in CVPR, 2023, pp. 14 297–14 306.
- A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” in ICML, 2021, pp. 7632–7642.
- R. Miles, M. K. Yucel, B. Manganelli, and A. Saà-Garriga, “MobileVOS: Real-time video object segmentation contrastive learning meets knowledge distillation,” in CVPR, 2023, pp. 10 480–10 490.
- S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in AAAI, 2020, pp. 5191–5198.
- Y. Niu, L. Chen, C. Zhou, and H. Zhang, “Respecting transfer gap in knowledge distillation,” NeurIPS, vol. 35, pp. 21 933–21 947, 2022.
- U. Ojha, Y. Li, A. Sundara Rajan, Y. Liang, and Y. J. Lee, “What knowledge gets distilled in knowledge distillation?” NeurIPS, pp. 1–12, 2024.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in CVPR, 2019, pp. 3967–3976.
- B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” in ICCV, 2019, pp. 5007–5016.
- M. Phuong and C. Lampert, “Towards understanding knowledge distillation,” in ICML, 2019, pp. 5142–5151.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in ICLR, 2015.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018, pp. 4510–4520.
- Z. Shen and E. Xing, “A fast knowledge distillation framework for visual recognition,” in ECCV, 2022, pp. 673–690.
- C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, “Channel-wise knowledge distillation for dense prediction,” in ICCV, 2021, pp. 5311–5320.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” NeurIPS, vol. 29, pp. 1–9, 2016.
- J. Song, Y. Chen, J. Ye, and M. Song, “Spot-adaptive knowledge distillation,” TIP, vol. 31, pp. 3359–3370, 2022.
- S. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?” NeurIPS, vol. 34, pp. 6906–6919, 2021.
- S. Sun, W. Ren, J. Li, R. Wang, and X. Cao, “Logit standardization in knowledge distillation,” CVPR, 2024.
- J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, and S. Jain, “Understanding and improving knowledge distillation,” arXiv preprint arXiv:2002.03532, 2020.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, 2020, pp. 776–794.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in ICLR, 2020, pp. 1–12.
- Z. Tu, X. Liu, and X. Xiao, “A general dynamic knowledge distillation method for visual analytics,” TIP, vol. 31, pp. 6517–6531, 2022.
- F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in ICCV, 2019, pp. 1365–1374.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
- L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks,” TPAMI, vol. 44, no. 6, pp. 3048–3068, 2021.
- R. Wang, Y. Hao, L. Hu, X. Li, M. Chen, Y. Miao, and I. Humar, “Efficient crowd counting via dual knowledge distillation,” TIP, vol. 33, pp. 569–583, 2023.
- T. Wang, L. Yuan, X. Zhang, and J. Feng, “Distilling object detectors with fine-grained feature imitation,” in CVPR, 2019, pp. 4933–4942.
- T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in ICML, 2020, pp. 9929–9939.
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in CVPR, 2018, pp. 3733–3742.
- G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in ECCV, 2020, pp. 588–604.
- C. Yang, Z. An, H. Zhou, F. Zhuang, Y. Xu, and Q. Zhang, “Online knowledge distillation via mutual contrastive learning for visual recognition,” TPAMI, vol. 45, no. 8, pp. 10 212–10 227, 2023.
- C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, “Cross-image relational knowledge distillation for semantic segmentation,” in CVPR, 2022, pp. 12 319–12 328.
- Z. Yang, Z. Li, X. Jiang, Y. Gong, Z. Yuan, D. Zhao, and C. Yuan, “Focal and global knowledge distillation for detectors,” in CVPR, 2022, pp. 4643–4652.
- Z. Yang, A. Zeng, C. Yuan, and Y. Li, “From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels,” in ICCV, 2023, pp. 17 185–17 194.
- C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y. Chen, and Y. LeCun, “Decoupled contrastive learning,” in ECCV, 2022, pp. 668–684.
- S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
- L. Zhang and X. Wu, “Latent space semantic supervision based on knowledge distillation for cross-modal retrieval,” TIP, vol. 31, pp. 7154–7164, 2022.
- L. Zhang, R. Dong, H.-S. Tai, and K. Ma, “Pointdistiller: Structured knowledge distillation towards efficient and compact 3d detection,” in CVPR, 2023, pp. 21 791–21 801.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR, 2018, pp. 6848–6856.
- Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in CVPR, 2018, pp. 4320–4328.
- B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in CVPR, 2022, pp. 11 953–11 962.
- Wencheng Zhu (2 papers)
- Xin Zhou (319 papers)
- Pengfei Zhu (76 papers)
- Yu Wang (939 papers)
- Qinghua Hu (83 papers)