Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping (2310.07855v2)

Published 11 Oct 2023 in cs.CV and cs.LG

Abstract: Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371, 2019.
  2. Mine your own view: Self-supervised learning through across-sample prediction. CoRR, abs/2102.10106, 2021. URL https://arxiv.org/abs/2102.10106.
  3. Towards in-context scene understanding. arXiv preprint arXiv:2306.01667, 2023.
  4. Vicregl: Self-supervised learning of local visual features. arXiv preprint arXiv:2210.01571, 2022.
  5. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1209–1218, 2018.
  6. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp.  132–149, 2018.
  7. Unsupervised learning of visual features by contrasting cluster assignments. In Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS), 2020.
  8. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
  9. Location-aware self-supervised transformers. arXiv preprint arXiv:2212.02400, 2022.
  10. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  11. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
  12. Exploring simple siamese representation learning. CoRR, abs/2011.10566, 2020. URL https://arxiv.org/abs/2011.10566.
  13. Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16794–16804, 2021.
  14. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  15. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26:2292–2300, 2013.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  18. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. pp.  9568–9577. IEEE Computer Society, October 2021. ISBN 978-1-66542-812-5. doi: 10.1109/ICCV48922.2021.00945. URL https://www.computer.org/csdl/proceedings-article/iccv/2021/281200j568/1BmJDKW7JXq.
  19. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  20. You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems, 34:26183–26197, 2021.
  21. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020a.
  22. Bootstrap your own latent: A new approach to self-supervised learning, 2020b.
  23. Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
  24. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  25. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9729–9738, 2020.
  26. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  27. Efficient visual pretraining with contrastive detection. arXiv preprint arXiv:2103.10957, 2021.
  28. Object discovery and representation networks. arXiv preprint arXiv:2203.08777, 2022.
  29. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
  30. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  31. Mean shift for self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  10326–10335, October 2021.
  32. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  33. Global-local self-distillation for visual representation learning. arXiv preprint arXiv:2207.14676, 2022.
  34. Adaptive similarity bootstrapping for self-distillation. arXiv preprint arXiv:2303.13606, 2023.
  35. Efficient self-supervised vision transformers for representation learning. CoRR, abs/2106.09785, 2021. URL https://arxiv.org/abs/2106.09785.
  36. Microsoft coco: Common objects in context. In European conference on computer vision, pp.  740–755. Springer, 2014.
  37. Self-emd: Self-supervised object detection without imagenet. arXiv preprint arXiv:2011.13677, 2020.
  38. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  891–898, 2014.
  39. Unsupervised learning of dense visual representations. Advances in Neural Information Processing Systems, 33:4489–4500, 2020.
  40. Time does tell: Self-supervised time-tuning of dense image representations. ICCV, 2023.
  41. Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022.
  42. Don’t judge an object by its context: Learning to overcome contextual bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11070–11078, 2020.
  43. Croc: Cross-view online clustering for dense visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7000–7009, June 2023.
  44. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  7262–7272, 2021.
  45. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  46. Cp2: Copy-paste contrastive pretraining for semantic segmentation. arXiv preprint arXiv:2203.11709, 2022.
  47. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3024–3033, 2021.
  48. Self-supervised visual representation learning with semantic grouping. arXiv preprint arXiv:2205.15288, 2022.
  49. Unsupervised object-level representation learning from scene images. Advances in Neural Information Processing Systems, 34:28864–28876, 2021a.
  50. Delving into inter-image invariance for unsupervised visual representations. International Journal of Computer Vision, 130(12):2994–3013, September 2022. ISSN 1573-1405. doi: 10.1007/s11263-022-01681-x. URL http://dx.doi.org/10.1007/s11263-022-01681-x.
  51. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16684–16693, 2021b.
  52. Patch-level representation learning for self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8354–8363, 2022.
  53. Unsupervised semantic segmentation with self-supervised object-centric representations. arXiv preprint arXiv:2207.05027, 2022.
  54. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  633–641, 2017.
  55. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  56. Self-supervised learning of object parts for semantic segmentation. arXiv preprint arXiv:2204.13101, 2022.
Citations (4)

Summary

We haven't generated a summary for this paper yet.