Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning (2407.02014v1)

Published 2 Jul 2024 in cs.CV

Abstract: The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{https://github.com/visresearch/mgc}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems (NeurIPS), 33:9912–9924, 2020.
  2. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  3. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pages 1597–1607. PMLR, 2020a.
  4. Exploring simple siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021.
  5. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  6. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 9640–9649, 2021.
  7. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  9. The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV), 88:303–338, 2010.
  10. Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  11. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  12. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020.
  13. Self-supervised learning with local contrastive loss for detection and semantic segmentation. In Proceedings of the IEEE Conference on Applications of Computer Vision (WACV), pages 5624–5633, 2023.
  14. Panoptic feature pyramid networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408, 2019.
  15. Univip: A unified framework for self-supervised visual pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14627–14636, 2022.
  16. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
  17. Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020.
  18. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  19. Unsupervised learning of dense visual representations. In Advances in Neural Information Processing Systems (NeurIPS), pages 4489–4500, 2020.
  20. Spatially consistent representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1144–1153, 2021.
  21. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  22. Asymmetric patch sampling for contrastive learning. arXiv preprint arXiv:2306.02854, 2023a.
  23. Inter-instance similarity modeling for contrastive learning. arXiv preprint arXiv:2306.12243, 2023b.
  24. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pages 10347–10357. PMLR, 2021.
  25. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3024–3033, 2021.
  26. Aligning pretraining for detection via object-level contrastive learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 22682–22694, 2021.
  27. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3733–3742, 2018.
  28. Region similarity representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10539–10548, 2021.
  29. Detco: Unsupervised contrastive learning for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 8392–8401, 2021a.
  30. Unsupervised object-level representation learning from scene images. In Advances in Neural Information Processing Systems (NeurIPS), pages 28864–28876, 2021b.
  31. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16684–16693, 2021c.
  32. Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems (NeurIPS), 35:38571–38584, 2022.
  33. Instance localization for self-supervised detection pretraining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3987–3996, 2021.
  34. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6210–6219, 2019.
  35. Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8354–8363, 2022.
  36. Patch-level contrastive learning via positional query for visual pre-training. In International Conference on Machine Learning (ICML), 2023.
  37. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 633–641, 2017.
  38. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chengchao Shen (20 papers)
  2. Jianzhong Chen (3 papers)
  3. Jianxin Wang (58 papers)

Summary

We haven't generated a summary for this paper yet.