Understanding Self-Supervised Pretraining with Part-Aware Representation Learning (2301.11915v2)
Abstract: In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
- BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
- A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992.
- Coco-stuff: Thing and stuff classes in context. In CVPR, pp. 1209–1218, 2018.
- How to understand masked autoencoders. arXiv preprint arXiv:2202.03670, 2022.
- Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, volume 119 of Proceedings of Machine Learning Research, pp. 1597–1607. PMLR, 2020.
- Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, pp. 1971–1978, 2014.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022a.
- Exploring simple siamese representation learning. In CVPR, pp. 15750–15758, 2021.
- An empirical study of training self-supervised vision transformers. In ICCV, pp. 9640–9649, 2021.
- Intra-instance vicreg: Bag of self-supervised image patch embedding. arXiv preprint arXiv:2206.08954, 2022b.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Imagenet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Ieee, 2009.
- Hr-nas: Searching efficient high-resolution neural architectures with lightweight transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2982–2992, 2021.
- Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
- Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
- The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html, 2010.
- Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR, July 2017.
- Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Exploring long-sequence masked autoencoders. arXiv preprint arXiv:2210.07224, 2022.
- Understanding masked image modeling via learning occlusion invariant feature. arXiv preprint arXiv:2208.04164, 2022.
- Learning multiple layers of features from tiny images. 2009.
- Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems, 34:13165–13176, 2021.
- Microsoft coco: Common objects in context. In ECCV, pp. 740–755. Springer, 2014.
- Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp. 10012–10022, 2021.
- Object-aware cropping for self-supervised learning. arXiv preprint arXiv:2112.00319, 2021.
- Zero-shot text-to-image generation. In ICML, pp. 8821–8831. PMLR, 2021.
- Understanding contrastive learning requires incorporating inductive biases. CoRR, abs/2202.14037, 2022. URL https://arxiv.org/abs/2202.14037.
- Siamese image modeling for self-supervised vision representation learning. arXiv preprint arXiv:2206.01204, 2022.
- Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
- Revisiting contrastive methods for unsupervised learning of visual representations. Advances in Neural Information Processing Systems, 34:16238–16250, 2021.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Dense contrastive learning for self-supervised visual pre-training. In CVPR, pp. 3024–3033, 2021.
- Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141, 2022.
- Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8392–8401, 2021a.
- Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021b.
- Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021c.
- Revealing the dark secrets of masked image modeling. CoRR, abs/2205.13543, 2022a. doi: 10.48550/arXiv.2205.13543. URL https://doi.org/10.48550/arXiv.2205.13543.
- On data scaling in masked image modeling. CoRR, abs/2206.04664, 2022b. doi: 10.48550/arXiv.2206.04664. URL https://doi.org/10.48550/arXiv.2206.04664.
- Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
- Is self-supervised learning more robust than supervised learning? arXiv preprint arXiv:2206.05259, 2022.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3):302–321, 2019.
- Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
- E-crf: Embedded conditional random field for boundary-caused class weights confusion in semantic segmentation. In The Eleventh International Conference on Learning Representations, 2023.