Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HEAP: Unsupervised Object Discovery and Localization with Contrastive Grouping (2312.17492v2)

Published 29 Dec 2023 in cs.CV

Abstract: Unsupervised object discovery and localization aims to detect or segment objects in an image without any supervision. Recent efforts have demonstrated a notable potential to identify salient foreground objects by utilizing self-supervised transformer features. However, their scopes only build upon patch-level features within an image, neglecting region/image-level and cross-image relationships at a broader scale. Moreover, these methods cannot differentiate various semantics from multiple instances. To address these problems, we introduce Hierarchical mErging framework via contrAstive grouPing (HEAP). Specifically, a novel lightweight head with cross-attention mechanism is designed to adaptively group intra-image patches into semantically coherent regions based on correlation among self-supervised features. Further, to ensure the distinguishability among various regions, we introduce a region-level contrastive clustering loss to pull closer similar regions across images. Also, an image-level contrastive loss is present to push foreground and background representations apart, with which foreground objects and background are accordingly discovered. HEAP facilitates efficient hierarchical image decomposition, which contributes to more accurate object discovery while also enabling differentiation among objects of various classes. Extensive experimental results on semantic segmentation retrieval, unsupervised object discovery, and saliency detection tasks demonstrate that HEAP achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  2. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650–9660.
  3. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640–9649.
  4. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1290–1299.
  5. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  6. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
  7. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  8. Unsupervised Semantic Segmentation by Distilling Feature Correspondences. In International Conference on Learning Representations.
  9. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009.
  10. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
  11. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In European Conference on Computer Vision, 404–421. Springer.
  12. Multi-class cosegmentation. In 2012 IEEE conference on computer vision and pattern recognition, 542–549. IEEE.
  13. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24.
  14. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. arXiv preprint arXiv:2206.02777.
  15. ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7162–7172.
  16. A weighted sparse coding framework for saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5216–5223.
  17. Exploring plain vision transformer backbones for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 280–296. Springer.
  18. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  19. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
  20. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8364–8375.
  21. Finding and evaluating community structure in networks. Physical review E, 69(2): 026113.
  22. Hierarchical image saliency detection on extended CSSD. IEEE transactions on pattern analysis and machine intelligence, 38(4): 717–729.
  23. Unsupervised salient object detection with spectral cluster voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3971–3980.
  24. Localizing Objects with Self-Supervised Transformers and no Labels. In BMVC-British Machine Vision Conference.
  25. Unsupervised Object Localization: Observing the Background to Discover Objects. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR.
  26. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7262–7272.
  27. What makes for good views for contrastive learning? Advances in neural information processing systems, 33: 6827–6839.
  28. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42.
  29. Selective search for object recognition. International journal of computer vision, 104: 154–171.
  30. Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10052–10062.
  31. Object cosegmentation. In CVPR 2011, 2217–2224. IEEE.
  32. Unsupervised image matching and object discovery as optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8287–8296.
  33. Toward unsupervised, multi-object discovery in large-scale image collections. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, 779–795. Springer.
  34. Large-scale unsupervised object discovery. Advances in Neural Information Processing Systems, 34: 16764–16778.
  35. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 136–145.
  36. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3124–3134.
  37. Freesolo: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14176–14186.
  38. Self-supervised transformers for unsupervised object discovery using normalized cut. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14543–14553.
  39. Unsupervised object discovery and co-localization by deep descriptor transformation. Pattern Recognition, 88: 113–126.
  40. Self-Supervised Visual Representation Learning with Semantic Grouping. In Advances in Neural Information Processing Systems.
  41. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 12077–12090.
  42. CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4483–4492.
  43. Online Refinement of Low-Level Feature Based Activation Map for Weakly Supervised Object Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 132–141.
  44. C2AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 989–998.
  45. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18134–18144.
  46. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3166–3173.
  47. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605.
  48. Adaptive domain generalization via online disagreement minimization. IEEE Transactions on Image Processing.
  49. ViT-YOLO: Transformer-based YOLO for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, 2799–2808.
  50. iBOT: Image BERT Pre-Training with Online Tokenizer. International Conference on Learning Representations (ICLR).
  51. Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2814–2821.
  52. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 391–405. Springer.

Summary

We haven't generated a summary for this paper yet.