CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping (2310.07855v2)
Abstract: Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.
- Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371, 2019.
- Mine your own view: Self-supervised learning through across-sample prediction. CoRR, abs/2102.10106, 2021. URL https://arxiv.org/abs/2102.10106.
- Towards in-context scene understanding. arXiv preprint arXiv:2306.01667, 2023.
- Vicregl: Self-supervised learning of local visual features. arXiv preprint arXiv:2210.01571, 2022.
- Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1209–1218, 2018.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp. 132–149, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. In Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS), 2020.
- Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
- Location-aware self-supervised transformers. arXiv preprint arXiv:2212.02400, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Exploring simple siamese representation learning. CoRR, abs/2011.10566, 2020. URL https://arxiv.org/abs/2011.10566.
- Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16794–16804, 2021.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26:2292–2300, 2013.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. pp. 9568–9577. IEEE Computer Society, October 2021. ISBN 978-1-66542-812-5. doi: 10.1109/ICCV48922.2021.00945. URL https://www.computer.org/csdl/proceedings-article/iccv/2021/281200j568/1BmJDKW7JXq.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems, 34:26183–26197, 2021.
- Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020a.
- Bootstrap your own latent: A new approach to self-supervised learning, 2020b.
- Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Efficient visual pretraining with contrastive detection. arXiv preprint arXiv:2103.10957, 2021.
- Object discovery and representation networks. arXiv preprint arXiv:2203.08777, 2022.
- Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Mean shift for self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10326–10335, October 2021.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Global-local self-distillation for visual representation learning. arXiv preprint arXiv:2207.14676, 2022.
- Adaptive similarity bootstrapping for self-distillation. arXiv preprint arXiv:2303.13606, 2023.
- Efficient self-supervised vision transformers for representation learning. CoRR, abs/2106.09785, 2021. URL https://arxiv.org/abs/2106.09785.
- Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
- Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020.
- The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 891–898, 2014.
- Unsupervised learning of dense visual representations. Advances in Neural Information Processing Systems, 33:4489–4500, 2020.
- Time does tell: Self-supervised time-tuning of dense image representations. ICCV, 2023.
- Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022.
- Don’t judge an object by its context: Learning to overcome contextual bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11070–11078, 2020.
- Croc: Cross-view online clustering for dense visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7000–7009, June 2023.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7262–7272, 2021.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Cp2: Copy-paste contrastive pretraining for semantic segmentation. arXiv preprint arXiv:2203.11709, 2022.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033, 2021.
- Self-supervised visual representation learning with semantic grouping. arXiv preprint arXiv:2205.15288, 2022.
- Unsupervised object-level representation learning from scene images. Advances in Neural Information Processing Systems, 34:28864–28876, 2021a.
- Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130(12):2994–3013, September 2022. ISSN 1573-1405. doi: 10.1007/s11263-022-01681-x. URL http://dx.doi.org/10.1007/s11263-022-01681-x.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16684–16693, 2021b.
- Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8354–8363, 2022.
- Unsupervised semantic segmentation with self-supervised object-centric representations. arXiv preprint arXiv:2207.05027, 2022.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641, 2017.
- Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- Self-supervised learning of object parts for semantic segmentation. arXiv preprint arXiv:2204.13101, 2022.