Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Heterogeneous Generative Knowledge Distillation with Masked Image Modeling (2309.09571v2)

Published 18 Sep 2023 in cs.CV and cs.AI

Abstract: Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. “Beit: Bert pre-training of image transformers” In arXiv preprint arXiv:2106.08254, 2021
  2. “Cross-Layer Distillation with Semantic Calibration”, 2021 arXiv:2012.03236 [cs.CV]
  3. “Knowledge distillation with the reused teacher classifier” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11933–11942
  4. “Generative pretraining from pixels” In International conference on machine learning, 2020, pp. 1691–1703 PMLR
  5. “Context autoencoder for self-supervised representation learning” In arXiv preprint arXiv:2202.03026, 2022
  6. Xinlei Chen, Saining Xie and Kaiming He “An empirical study of training self-supervised vision transformers” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9640–9649
  7. “Improved baselines with momentum contrastive learning” In arXiv preprint arXiv:2003.04297, 2020
  8. “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 Ieee
  9. Weijian Deng, Stephen Gould and Liang Zheng “What does rotation prediction tell us about classifier accuracy under varying testing environments?” In International Conference on Machine Learning, 2021, pp. 2579–2589 PMLR
  10. “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
  11. “Euclidean Distance Matrices: Essential theory, algorithms, and applications” In IEEE Signal Processing Magazine 32.6 Institute of ElectricalElectronics Engineers (IEEE), 2015, pp. 12–30 DOI: 10.1109/msp.2015.2398954
  12. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
  13. “Ré nyi Divergence and Kullback-Leibler Divergence” In IEEE Transactions on Information Theory 60.7 Institute of ElectricalElectronics Engineers (IEEE), 2014, pp. 3797–3820 DOI: 10.1109/tit.2014.2320500
  14. “SEED: Self-supervised Distillation For Visual Representation”, 2021 arXiv:2101.04731 [cs.CV]
  15. “Cross-domain correlation distillation for unsupervised domain adaptation in nighttime semantic segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9913–9923
  16. Spyros Gidaris, Praveer Singh and Nikos Komodakis “Unsupervised Representation Learning by Predicting Image Rotations”, 2018 arXiv:1803.07728 [cs.CV]
  17. Mitchell A. Gordon and Kevin Duh “Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation”, 2019 arXiv:1912.03334 [cs.CL]
  18. “Submanifold Sparse Convolutional Networks”, 2017 arXiv:1706.01307 [cs.NE]
  19. Song Han, Huizi Mao and William J Dally “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding” In arXiv preprint arXiv:1510.00149, 2015
  20. “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
  21. “Mask R-CNN”, 2018 arXiv:1703.06870 [cs.CV]
  22. “Masked autoencoders are scalable vision learners” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009
  23. “Momentum Contrast for Unsupervised Visual Representation Learning”, 2020 arXiv:1911.05722 [cs.CV]
  24. “Knowledge transfer via distillation of activation boundaries formed by hidden neurons” In Proceedings of the AAAI Conference on Artificial Intelligence 33.01, 2019, pp. 3779–3787
  25. Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the Knowledge in a Neural Network”, 2015 arXiv:1503.02531 [stat.ML]
  26. “Mobilenets: Efficient convolutional neural networks for mobile vision applications” In arXiv preprint arXiv:1704.04861, 2017
  27. “Knowledge distillation from a stronger teacher” In arXiv preprint arXiv:2205.10536, 2022
  28. “Learning image representations by completing damaged jigsaw puzzles” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 793–802 IEEE
  29. “Layer-level knowledge distillation for deep neural network learning” In Applied Sciences 9.10 MDPI, 2019, pp. 1966
  30. Kunchi Li, Jun Wan and Shan Yu “CKDF: Cascaded knowledge distillation framework for robust incremental learning” In IEEE Transactions on Image Processing 31 IEEE, 2022, pp. 3825–3837
  31. “Feature pyramid networks for object detection” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125
  32. “Microsoft coco: Common objects in context” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755 Springer
  33. “Swin transformer: Hierarchical vision transformer using shifted windows” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022
  34. David G Lowe “Object recognition from local scale-invariant features” In Proceedings of the seventh IEEE international conference on computer vision 2, 1999, pp. 1150–1157 Ieee
  35. “Relational Knowledge Distillation”, 2019 arXiv:1904.05068 [cs.CV]
  36. “ALP-KD: Attention-Based Layer Projection for Knowledge Distillation”, 2020 arXiv:2012.14022 [cs.CL]
  37. “Context Encoders: Feature Learning by Inpainting”, 2016 arXiv:1604.07379 [cs.CV]
  38. “Improving language understanding by generative pre-training” OpenAI, 2018
  39. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2016 arXiv:1506.01497 [cs.CV]
  40. Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015 arXiv:1505.04597 [cs.CV]
  41. “Understanding and Improving Knowledge Distillation”, 2021 arXiv:2002.03532 [cs.LG]
  42. “Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling”, 2023 arXiv:2301.03580 [cs.CV]
  43. Yonglong Tian, Dilip Krishnan and Phillip Isola “Contrastive Representation Distillation”, 2022 arXiv:1910.10699 [cs.LG]
  44. “Extracting and composing robust features with denoising autoencoders” In Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103
  45. Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin and Marco Visentini-Scarzanella “Unifying Heterogeneous Classifiers with Distillation”, 2019 arXiv:1904.06062 [cs.CV]
  46. “Exclusivity-consistency regularized knowledge distillation for face recognition” In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, 2020, pp. 325–342 Springer
  47. “B-AT-KD: Binary attention map knowledge distillation” In Neurocomputing 511 Elsevier, 2022, pp. 299–307
  48. Ross Wightman, Hugo Touvron and Herve Jegou “ResNet strikes back: An improved training procedure in timm” In NeurIPS 2021 Workshop on ImageNet: Past, Present, and Future, 2021 URL: https://openreview.net/forum?id=NG6MJnVl6M5
  49. “Why skip if you can combine: A simple knowledge distillation technique for intermediate layers” In arXiv preprint arXiv:2010.03034, 2020
  50. “Knowledge Distillation Meets Self-Supervision”, 2020 arXiv:2006.07114 [cs.CV]
  51. “Hierarchical Self-supervised Augmented Knowledge Distillation” In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence International Joint Conferences on Artificial Intelligence Organization, 2021 DOI: 10.24963/ijcai.2021/168
  52. “Hierarchical self-supervised augmented knowledge distillation” In arXiv preprint arXiv:2107.13715, 2021
  53. “Knowledge distillation via softmax regression representation learning” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=ZzwDy_wiWv
  54. “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes”, 2020 arXiv:1904.00962 [cs.LG]
  55. “Reinforced Multi-Teacher Selection for Knowledge Distillation”, 2020 arXiv:2012.06048 [cs.CL]
  56. “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer”, 2017 arXiv:1612.03928 [cs.CV]
  57. Richard Zhang, Phillip Isola and Alexei A. Efros “Colorful Image Colorization”, 2016 arXiv:1603.08511 [cs.CV]
  58. “Highlight Every Step: Knowledge Distillation via Collaborative Teaching”, 2019 arXiv:1907.09643 [cs.CV]
  59. “Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective” In arXiv preprint arXiv:2102.00650, 2021

Summary

We haven't generated a summary for this paper yet.