Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models (2404.11957v1)

Published 18 Apr 2024 in cs.CV

Abstract: Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp.  213–229. Springer, 2020.
  3. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  4. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Curriculum point prompting for weakly-supervised referring segmentation. 2024.
  7. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Ieee, 2009.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  10. Promptdet: Expand your detector vocabulary with uncurated images. arXiv preprint arXiv:2203.16513, 2022.
  11. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  1440–1448, 2015.
  12. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  13. Open-vocabulary object detection via vision and language knowledge distillation. International Conference on Learning Representations, 2022.
  14. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
  16. Mask R-CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.
  17. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. Advances in Neural Information Processing Systems, 36, 2024.
  18. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  19. F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639, 2022.
  20. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  21. Clip surgery for better explainability with enhancement in open-vocabulary tasks, 2023.
  22. Microsoft coco: Common objects in context. In The European Conference on Computer Vision, 2014.
  23. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp.  388–404. Springer, 2022.
  24. Dinov2: Learning robust visual features without supervision, 2023.
  25. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2020.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  27. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems, 28, 2015.
  28. High-resolution image synthesis with latent diffusion models, 2021.
  29. Edadet: Open-vocabulary object detection using early dense alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15724–15734, 2023a.
  30. Logoprompt: Synthetic text images can be good visual prompts for vision-language models. arXiv preprint arXiv:2309.01155, 2023b.
  31. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
  32. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021.
  33. Contrastive grouping with transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23570–23580, 2023a.
  34. Temporal collection and distribution for referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  15466–15476, 2023b.
  35. Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363, 2022.
  36. FreeSOLO: Learning to segment objects without annotations. arXiv preprint arXiv:2202.12181, 2022a.
  37. Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320, 2023.
  38. Self-supervised transformers for unsupervised object discovery using normalized cut. In Conference on Computer Vision and Pattern Recognition, 2022b.
  39. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. ArXiv, abs/2303.13076, 2023.
  40. Groupvit: Semantic segmentation emerges from text supervision. arXiv preprint arXiv:2202.11094, 2022.
  41. Regionclip: Region-based language-image pretraining. arXiv preprint arXiv:2112.09106, 2021.
  42. Extract free dense labels from clip. In ECCV, 2022a.
  43. Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  44. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022c.
  45. Detecting twenty-thousand classes using image-level supervision. arXiv preprint arXiv:2201.02605, 2022d.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com