Heterogeneous Generative Knowledge Distillation with Masked Image Modeling (2309.09571v2)
Abstract: Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.
- “Beit: Bert pre-training of image transformers” In arXiv preprint arXiv:2106.08254, 2021
- “Cross-Layer Distillation with Semantic Calibration”, 2021 arXiv:2012.03236 [cs.CV]
- “Knowledge distillation with the reused teacher classifier” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11933–11942
- “Generative pretraining from pixels” In International conference on machine learning, 2020, pp. 1691–1703 PMLR
- “Context autoencoder for self-supervised representation learning” In arXiv preprint arXiv:2202.03026, 2022
- Xinlei Chen, Saining Xie and Kaiming He “An empirical study of training self-supervised vision transformers” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9640–9649
- “Improved baselines with momentum contrastive learning” In arXiv preprint arXiv:2003.04297, 2020
- “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 Ieee
- Weijian Deng, Stephen Gould and Liang Zheng “What does rotation prediction tell us about classifier accuracy under varying testing environments?” In International Conference on Machine Learning, 2021, pp. 2579–2589 PMLR
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
- “Euclidean Distance Matrices: Essential theory, algorithms, and applications” In IEEE Signal Processing Magazine 32.6 Institute of ElectricalElectronics Engineers (IEEE), 2015, pp. 12–30 DOI: 10.1109/msp.2015.2398954
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
- “Ré nyi Divergence and Kullback-Leibler Divergence” In IEEE Transactions on Information Theory 60.7 Institute of ElectricalElectronics Engineers (IEEE), 2014, pp. 3797–3820 DOI: 10.1109/tit.2014.2320500
- “SEED: Self-supervised Distillation For Visual Representation”, 2021 arXiv:2101.04731 [cs.CV]
- “Cross-domain correlation distillation for unsupervised domain adaptation in nighttime semantic segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9913–9923
- Spyros Gidaris, Praveer Singh and Nikos Komodakis “Unsupervised Representation Learning by Predicting Image Rotations”, 2018 arXiv:1803.07728 [cs.CV]
- Mitchell A. Gordon and Kevin Duh “Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation”, 2019 arXiv:1912.03334 [cs.CL]
- “Submanifold Sparse Convolutional Networks”, 2017 arXiv:1706.01307 [cs.NE]
- Song Han, Huizi Mao and William J Dally “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding” In arXiv preprint arXiv:1510.00149, 2015
- “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
- “Mask R-CNN”, 2018 arXiv:1703.06870 [cs.CV]
- “Masked autoencoders are scalable vision learners” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009
- “Momentum Contrast for Unsupervised Visual Representation Learning”, 2020 arXiv:1911.05722 [cs.CV]
- “Knowledge transfer via distillation of activation boundaries formed by hidden neurons” In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 3779–3787
- Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the Knowledge in a Neural Network”, 2015 arXiv:1503.02531 [stat.ML]
- “Mobilenets: Efficient convolutional neural networks for mobile vision applications” In arXiv preprint arXiv:1704.04861, 2017
- “Knowledge distillation from a stronger teacher” In arXiv preprint arXiv:2205.10536, 2022
- “Learning image representations by completing damaged jigsaw puzzles” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 793–802 IEEE
- “Layer-level knowledge distillation for deep neural network learning” In Applied Sciences 9.10 MDPI, 2019, pp. 1966
- Kunchi Li, Jun Wan and Shan Yu “CKDF: Cascaded knowledge distillation framework for robust incremental learning” In IEEE Transactions on Image Processing 31 IEEE, 2022, pp. 3825–3837
- “Feature pyramid networks for object detection” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125
- “Microsoft coco: Common objects in context” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755 Springer
- “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
- David G Lowe “Object recognition from local scale-invariant features” In Proceedings of the seventh IEEE international conference on computer vision 2, 1999, pp. 1150–1157 Ieee
- “Relational Knowledge Distillation”, 2019 arXiv:1904.05068 [cs.CV]
- “ALP-KD: Attention-Based Layer Projection for Knowledge Distillation”, 2020 arXiv:2012.14022 [cs.CL]
- “Context Encoders: Feature Learning by Inpainting”, 2016 arXiv:1604.07379 [cs.CV]
- “Improving language understanding by generative pre-training” OpenAI, 2018
- “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2016 arXiv:1506.01497 [cs.CV]
- Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015 arXiv:1505.04597 [cs.CV]
- “Understanding and Improving Knowledge Distillation”, 2021 arXiv:2002.03532 [cs.LG]
- “Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling”, 2023 arXiv:2301.03580 [cs.CV]
- Yonglong Tian, Dilip Krishnan and Phillip Isola “Contrastive Representation Distillation”, 2022 arXiv:1910.10699 [cs.LG]
- “Extracting and composing robust features with denoising autoencoders” In Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103
- Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin and Marco Visentini-Scarzanella “Unifying Heterogeneous Classifiers with Distillation”, 2019 arXiv:1904.06062 [cs.CV]
- “Exclusivity-consistency regularized knowledge distillation for face recognition” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, 2020, pp. 325–342 Springer
- “B-AT-KD: Binary attention map knowledge distillation” In Neurocomputing 511 Elsevier, 2022, pp. 299–307
- Ross Wightman, Hugo Touvron and Herve Jegou “ResNet strikes back: An improved training procedure in timm” In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021 URL: https://openreview.net/forum?id=NG6MJnVl6M5
- “Why skip if you can combine: A simple knowledge distillation technique for intermediate layers” In arXiv preprint arXiv:2010.03034, 2020
- “Knowledge Distillation Meets Self-Supervision”, 2020 arXiv:2006.07114 [cs.CV]
- “Hierarchical Self-supervised Augmented Knowledge Distillation” In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence International Joint Conferences on Artificial Intelligence Organization, 2021 DOI: 10.24963/ijcai.2021/168
- “Hierarchical self-supervised augmented knowledge distillation” In arXiv preprint arXiv:2107.13715, 2021
- “Knowledge distillation via softmax regression representation learning” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=ZzwDy_wiWv
- “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes”, 2020 arXiv:1904.00962 [cs.LG]
- “Reinforced Multi-Teacher Selection for Knowledge Distillation”, 2020 arXiv:2012.06048 [cs.CL]
- “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2017 arXiv:1612.03928 [cs.CV]
- Richard Zhang, Phillip Isola and Alexei A. Efros “Colorful Image Colorization”, 2016 arXiv:1603.08511 [cs.CV]
- “Highlight Every Step: Knowledge Distillation via Collaborative Teaching”, 2019 arXiv:1907.09643 [cs.CV]
- “Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective” In arXiv preprint arXiv:2102.00650, 2021