Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning (2407.02014v1)
Abstract: The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{https://github.com/visresearch/mgc}.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems (NeurIPS), 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pages 1597–1607. PMLR, 2020a.
- Exploring simple siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9640–9649, 2021.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV), 88:303–338, 2010.
- Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020.
- Self-supervised learning with local contrastive loss for detection and semantic segmentation. In Proceedings of the IEEE Conference on Applications of Computer Vision (WACV), pages 5624–5633, 2023.
- Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408, 2019.
- Univip: A unified framework for self-supervised visual pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14627–14636, 2022.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
- Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Unsupervised learning of dense visual representations. In Advances in Neural Information Processing Systems (NeurIPS), pages 4489–4500, 2020.
- Spatially consistent representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1144–1153, 2021.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Asymmetric patch sampling for contrastive learning. arXiv preprint arXiv:2306.02854, 2023a.
- Inter-instance similarity modeling for contrastive learning. arXiv preprint arXiv:2306.12243, 2023b.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pages 10347–10357. PMLR, 2021.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3024–3033, 2021.
- Aligning pretraining for detection via object-level contrastive learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 22682–22694, 2021.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
- Region similarity representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10539–10548, 2021.
- Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 8392–8401, 2021a.
- Unsupervised object-level representation learning from scene images. In Advances in Neural Information Processing Systems (NeurIPS), pages 28864–28876, 2021b.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16684–16693, 2021c.
- Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems (NeurIPS), 35:38571–38584, 2022.
- Instance localization for self-supervised detection pretraining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3987–3996, 2021.
- Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6210–6219, 2019.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8354–8363, 2022.
- Patch-level contrastive learning via positional query for visual pre-training. In International Conference on Machine Learning (ICML), 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017.
- ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
- Chengchao Shen (20 papers)
- Jianzhong Chen (3 papers)
- Jianxin Wang (58 papers)