Improve Knowledge Distillation via Label Revision and Data Selection (2404.03693v1)
Abstract: Knowledge distillation (KD) has become a widely used technique in the field of model compression, which aims to transfer knowledge from a large teacher model to a lightweight student model for efficient network development. In addition to the supervision of ground truth, the vanilla KD method regards the predictions of the teacher as soft labels to supervise the training of the student model. Based on vanilla KD, various approaches have been developed to further improve the performance of the student model. However, few of these previous methods have considered the reliability of the supervision from teacher models. Supervision from erroneous predictions may mislead the training of the student model. This paper therefore proposes to tackle this problem from two aspects: Label Revision to rectify the incorrect supervision and Data Selection to select appropriate samples for distillation to reduce the impact of erroneous supervision. In the former, we propose to rectify the teacher's inaccurate predictions using the ground truth. In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher, thereby reducing the impact of incorrect predictions to some extent. Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches, improving their performance.
- S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- M. Zawish, S. Davy, and L. Abraham, “Complexity-driven model compression for resource-constrained deep learning on edge,” IEEE Transactions on Artificial Intelligence, pp. 1–15, 2024.
- L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3048–3068, 2022.
- Q. Xu, K. Wu, M. Wu, K. Mao, X. Li, and Z. Chen, “Reinforced knowledge distillation for time series regression,” IEEE Transactions on Artificial Intelligence, pp. 1–11, 2023.
- W.-C. Kao, H.-X. Xie, C.-Y. Lin, and W.-H. Cheng, “Specific expert learning: Enriching ensemble diversity via knowledge distillation,” IEEE Transactions on Cybernetics, vol. 53, no. 4, pp. 2494–2505, 2023.
- Z. Wang, Y. Ren, X. Zhang, and Y. Wang, “Generating long financial report using conditional variational autoencoders with knowledge distillation,” IEEE Transactions on Artificial Intelligence, pp. 1–12, 2024.
- G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 06, 2021.
- D. Chen, J.-P. Mei, H. Zhang, C. Wang, Y. Feng, and C. Chen, “Knowledge distillation with the reused teacher classifier,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 933–11 942.
- S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 5191–5198.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, and X. Hu, “Knowledge distillation via route constrained optimization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1345–1354.
- S. H. Lee, D. H. Kim, and B. C. Song, “Self-supervised knowledge distillation using singular value decomposition,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 339–354.
- N. Passalis, M. Tzelepi, and A. Tefas, “Heterogeneous knowledge distillation using information flow modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2339–2348.
- J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021.
- F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1365–1374.
- S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9163–9171.
- D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen, “Cross-layer distillation with semantic calibration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 7028–7036.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” arXiv preprint arXiv:1910.10699, 2019.
- J. Yang, B. Martinez, A. Bulat, G. Tzimiropoulos et al., “Knowledge distillation via softmax regression representation learning,” in Proceedings of International Conference on Learning Representations, 2021.
- G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 588–604.
- H. Zhou and L. Song, “Rethinking soft labels for knowledge distillation: A bias–variance tradeoff perspective,” in Proceedings of International Conference on Learning Representations, 2021.
- T. Kim, J. Oh, N. Y. Kim, S. Cho, and S.-Y. Yun, “Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation,” in Proceedings of International Joint Conference on Artificial Intelligence, 2021, pp. 2628–2635.
- B. Zhao, Q. Cui, R. Song, Y. Qiu, and J. Liang, “Decoupled knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 953–11 962.
- S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
- Z. Huang and N. Wang, “Like what you like: Knowledge distill via neuron selectivity transfer,” arXiv preprint arXiv:1707.01219, 2017.
- J. Kim, S. Park, and N. Kwak, “Paraphrasing complex network: Network compression via factor transfer,” Advances in Neural Information Processing Systems, pp. 2760–2769, 2018.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in Proceedings of International Conference on Learning Representations, 2020, pp. 1–13.
- H. Zhao, X. Sun, J. Dong, C. Chen, and Z. Dong, “Highlight every step: Knowledge distillation via collaborative teaching,” IEEE Transactions on Cybernetics, vol. 52, no. 4, pp. 2070–2081, 2020.
- C. Wang, D. Chen, J.-P. Mei, Y. Zhang, Y. Feng, and C. Chen, “Semckd: semantic calibration for cross-layer knowledge distillation,” IEEE Transactions on Knowledge and Data Engineering, 2022.
- W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
- N. Passalis, M. Tzelepi, and A. Tefas, “Probabilistic knowledge transfer for lightweight deep representation learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 5, pp. 2030–2039, 2020.
- J. Gou, L. Sun, B. Yu, S. Wan, W. Ou, and Z. Yi, “Multilevel attention-based sample correlations for knowledge distillation,” IEEE Transactions on Industrial Informatics, vol. 19, no. 5, pp. 7099–7109, 2022.
- M. Toneva, A. Sordoni, R. T. des Combes, A. Trischler, Y. Bengio, and G. J. Gordon, “An empirical study of example forgetting during deep neural network learning,” in Proceedings of International Conference on Learning Representations, 2019, pp. 1–12.
- J. Lin, A. Zhang, M. Lécuyer, J. Li, A. Panda, and S. Sen, “Measuring the effect of training data on deep learning predictions via randomized experiments,” in Proceedings of International Conference on Machine Learning, 2022, pp. 13 468–13 504.
- S. Paul, J. H. Bappy, and A. K. Roy-Chowdhury, “Non-uniform subset selection for active learning in structured data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 6846–6855.
- Z. Liu, H. Ding, H. Zhong, W. Li, J. Dai, and C. He, “Influence selection for active learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9274–9283.
- S. Sinha, S. Ebrahimi, and T. Darrell, “Variational adversarial active learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5972–5981.
- S. Wang, Y. Li, K. Ma, R. Ma, H. Guan, and Y. Zheng, “Dual adversarial network for deep active learning,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 680–696.
- S. Ruder and B. Plank, “Learning to select data for transfer learning with bayesian optimization,” arXiv preprint arXiv:1707.05246, 2017.
- F. Xiong, J. Barker, Z. Yue, and H. Christensen, “Source domain data selection for improved transfer learning targeting dysarthric speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7424–7428.
- J. Yoon, S. Arik, and T. Pfister, “Data valuation using reinforcement learning,” in Proceedings of International Conference on Machine Learning, 2020, pp. 10 842–10 851.
- P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” in Proceedings of International Conference on Machine Learning, 2017, pp. 1885–1894.
- A. Ghorbani and J. Zou, “Data shapley: Equitable valuation of data for machine learning,” in Proceedings of International Conference on Machine Learning, 2019, pp. 2242–2251.
- G. Pruthi, F. Liu, S. Kale, and M. Sundararajan, “Estimating training data influence by tracing gradient descent,” Advances in Neural Information Processing Systems, vol. 33, pp. 19 920–19 930, 2020.
- Z. Hammoudeh and D. Lowd, “Training data influence analysis and estimation: A survey,” arXiv preprint arXiv:2212.04612, 2022.
- A. Li, L. Zhang, J. Wang, F. Han, and X.-Y. Li, “Privacy-preserving efficient federated-learning model debugging,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 10, pp. 2291–2303, 2021.
- P. J. Huber, “Robust statistics,” in International Encyclopedia of Statistical Science, 2011, pp. 1248–1251.
- T. Wen, S. Lai, and X. Qian, “Preparing lessons: Improve knowledge distillation with better supervision,” Neurocomputing, vol. 454, pp. 25–33, 2021.
- A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical Report, CIFAR, 2009.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 116–131.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
- S. Hegde, R. Prasad, R. Hebbalaguppe, and V. Kumar, “Variational student: Learning compact and sparser networks in knowledge distillation framework,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 3247–3251.
- Z. Li, X. Li, L. Yang, B. Zhao, R. Song, L. Luo, J. Li, and J. Yang, “Curriculum temperature for knowledge distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1504–1512.
- N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledge transfer,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 268–284.
- Weichao Lan (6 papers)
- Qing Xu (71 papers)
- Buhua Liu (5 papers)
- Zhikai Hu (6 papers)
- Mengke Li (19 papers)
- Zhenghua Chen (51 papers)
- Yiu-Ming Cheung (40 papers)