Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners (2306.15876v1)

Published 28 Jun 2023 in cs.CV

Abstract: Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, while MIM can introduce more local and diverse attention across all transformer layers. In this paper, we explore how to obtain a model that combines their strengths. We start by examining previous feature distillation and mask feature reconstruction methods and identify their limitations. We find that their increasing diversity mainly derives from the asymmetric designs, but these designs may in turn compromise the discrimination ability. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model. Hybrid Distill imitates the token relations of the MIM teacher to alleviate attention collapse, as well as distills the feature maps of the supervised/CL teacher to enable discrimination. Furthermore, a progressive redundant token masking strategy is also utilized to reduce the distilling costs and avoid falling into local optima. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  2. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  4. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  5. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  6. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  7. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  8. Sdae: Self-distillated masked autoencoder. In ECCV, 2022.
  9. Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  12. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  13. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  14. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020.
  15. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  17. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
  18. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  19. Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
  20. Progressively compressed auto-encoder for self-supervised representation learning. In The Eleventh International Conference on Learning Representations, 2023.
  21. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  22. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1209–1218, 2014.
  23. Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2022.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  25. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729, 2023.
  28. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
  29. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  31. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  32. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), 115(3):211–252, 2015.
  33. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617, 2002.
  34. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  35. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
  36. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), volume 139, pages 10347–10357, July 2021.
  37. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  38. A closer look at self-supervised lightweight vision transformers, 2023.
  39. Exploring cross-image pixel contrast for semantic segmentation. arXiv preprint arXiv:2101.11939, 2021.
  40. Mvp: Multimodality-guided visual pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX, pages 337–353. Springer, 2022.
  41. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. Tech Report, 2022.
  42. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 418–434, 2018.
  43. Revealing the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022.
  44. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  45. Stare at what you see: Masked image modeling without reconstruction. arXiv preprint arXiv:2211.08887, 2022.
  46. Ernie: Enhanced language representation with informative entities. In ACL, pages 1441–1451, 2019.
  47. Semantic understanding of scenes through the ade20k dataset. International Journal on Computer Vision (IJCV), 127:302–321, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.