Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision (2402.08960v2)

Published 14 Feb 2024 in cs.CV and cs.AI

Abstract: Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs of masks and text entities. We then train a feature adapter to align region embeddings with text embeddings based on these pseudo labels, achieving open-vocabulary segmentation. However, the inherent noise in the mask-entity correspondence poses a challenge to obtaining reliable pairs. To address this, we employ a vision-language large model to re-caption the input images and extract precise entities, and we design a multi-scale matching strategy to reduce noisy mask-entity pairs. Our Unpair-Seg framework demonstrates impressive performance, achieving 14.6\% and 19.5\% mIoU on the ADE-847 and PASCAL Context-459 datasets, significantly narrowing the gap between fully-supervised and weakly-supervised methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
  2. Rsa: Reducing semantic shift from aggressive augmentations for self-supervised learning. NeurIPS, 35:21128–21141, 2022.
  3. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
  4. Yolact: Real-time instance segmentation. In ICCV, pages 9157–9166, 2019.
  5. Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, pages 1196–1205, 2023.
  6. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, pages 11165–11174, 2023.
  7. Hybrid task cascade for instance segmentation. In CVPR, pages 4974–4983, 2019.
  8. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
  9. A generalist framework for panoptic segmentation of images and videos. In ICCV, pages 909–919, 2023b.
  10. Open-vocabulary panoptic segmentation with embedding modulation. ICCV, 2023c.
  11. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, pages 12475–12485, 2020.
  12. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pages 17864–17875, 2021.
  13. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
  14. Domain adaptation for traffic density estimation. In VISIGRAPP (5: VISAPP), pages 185–195, 2021.
  15. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
  16. Decoupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022.
  17. Open-vocabulary universal image segmentation with maskclip. 2023.
  18. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
  19. Instance segmentation for autonomous log grasping in forestry operations. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6064–6071. IEEE, 2022.
  20. Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019.
  21. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557, 2022.
  22. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  23. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  24. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  25. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  26. Trashcan: A semantically-segmented dataset towards visual detection of marine debris. arXiv preprint arXiv:2007.08097, 2020.
  27. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  28. Pointshift: Point-wise shift mlp for pixel-level cloud type classification in meteorological satellite imagery. In IGARSS, pages 607–610. IEEE, 2022.
  29. Openclip, 2021. If you use this software, please cite it as below.
  30. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
  31. An optimal algorithm for on-line bipartite matching. In STOC, pages 352–358, 1990.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Panoptic segmentation. In CVPR, pages 9404–9413, 2019.
  34. Segment anything. ICCV, 2023.
  35. Language-driven semantic segmentation. In ICLR, 2022a.
  36. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  37. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, pages 1280–1289, 2022b.
  38. Gmmseg: Gaussian mixture based generative semantic segmentation models. In NeurIPS, pages 31360–31375, 2022.
  39. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
  40. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  41. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  42. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  43. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  44. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  45. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  46. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044, 2023.
  47. Unsupervised universal image segmentation. arXiv preprint arXiv:2312.17243, 2023.
  48. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  49. Perceptual grouping in contrastive vision-language models. In ICCV, pages 5571–5584, 2023.
  50. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
  51. Streets: A novel camera network dataset for traffic flow. NeurIPS, 32, 2019.
  52. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  53. Conditional convolutions for instance segmentation. In ECCV, pages 282–298, 2020.
  54. Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):669–680, 2022.
  55. Attention is all you need. NeurIPS, 30, 2017.
  56. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, pages 108–126, 2020.
  57. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, pages 5463–5474, 2021a.
  58. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023a.
  59. Solo: A simple framework for instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8587–8601, 2021b.
  60. Cut and learn for unsupervised object detection and instance segmentation. In CVPR, pages 3124–3134, 2023b.
  61. Mosaic representation learning for self-supervised visual pre-training. In ICLR, 2022a.
  62. Exploring set similarity for dense self-supervised representation learning. In CVPR, pages 16590–16599, 2022b.
  63. Cris: Clip-driven referring image segmentation. In CVPR, pages 11686–11695, 2022c.
  64. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pages 18134–18144, 2022a.
  65. Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, pages 2935–2944, 2023a.
  66. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023b.
  67. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753, 2022b.
  68. Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023c.
  69. ishape: A first step towards irregular shape instance segmentation. arXiv preprint arXiv:2109.15068, 2021a.
  70. Objects in semantic topology. In ICLR, 2021b.
  71. Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9308–9318, 2019.
  72. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  73. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  74. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, pages 2560–2570, 2022b.
  75. k-means mask transformer. In ECCV, pages 288–307, 2022c.
  76. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023.
  77. Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, 129(8):2375–2398, 2021.
  78. Segvit: Semantic segmentation with plain vision transformers. In NeurIPS, pages 4971–4982, 2022a.
  79. Context encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018.
  80. A simple framework for open-vocabulary segmentation and detection. In ICCV, pages 1020–1031, 2023a.
  81. Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, pages 127–145. Springer, 2022b.
  82. Pidray: A large-scale x-ray benchmark for real-world prohibited item detection. International Journal of Computer Vision, 131(12):3170–3192, 2023b.
  83. K-net: Towards unified image segmentation. NeurIPS, 34:10326–10338, 2021.
  84. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  85. Extract free dense labels from clip. In ECCV, pages 696–712, 2022.
  86. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  87. Generalized decoding for pixel, image, and language. In CVPR, pages 15116–15127, 2023a.
  88. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhaoqing Wang (15 papers)
  2. Xiaobo Xia (43 papers)
  3. Ziye Chen (5 papers)
  4. Xiao He (54 papers)
  5. Yandong Guo (78 papers)
  6. Mingming Gong (135 papers)
  7. Tongliang Liu (251 papers)
Citations (6)

Summary

Uni-OVSeg: Enhancing Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Introduction to Open-Vocabulary Segmentation

The landscape of object segmentation in images, particularly open-vocabulary segmentation, has been a focus of intense research efforts due to its potential to dramatically improve the flexibility and applicability of computer vision systems. Unlike traditional segmentation methods that rely on a limited, predefined vocabulary, open-vocabulary segmentation aspires to identify and categorize objects across an unrestricted range of categories, regardless of whether these categories were seen during the model's training phase. This innovation could transform capabilities across various domains, from improving autonomous vehicle navigation to advancing medical diagnostics.

The Limitation of Existing Methods

Current state-of-the-art methods predominantly supervise their models using image-mask-text triplets. While effective, the need for such detailed annotations introduces significant labor costs, rendering the approach less scalable and impractical for handling the complex, diverse datasets encountered in real-world scenarios. Although some advancements have been made to minimize annotation costs by relying solely on text supervision, these approaches fall short in performance due to their inability to capture intricate spatial details and differentiate between distinct instances of the same semantic class effectively.

Uni-OVSeg: A Novel Framework

This paper introduces Uni-OVSeg, a groundbreaking weakly-supervised framework for open-vocabulary segmentation, addressing the aforementioned limitations by eliminating the necessity for paired image-mask-text annotations. Instead, Uni-OVSeg operates with unpaired image-mask and image-text pairs, which are significantly more straightforward to collect. By doing so, it manages to significantly cut down on the costs associated with data annotation without compromising on the quality of segmentation.

Technical Innovations of Uni-OVSeg

  • Mask Generation: Utilization of independent image-mask pairs to generate binary masks, followed by the allocation of these masks to entities in text descriptions drawn from unpaired image-text pairs.
  • Mask-Text Alignment: To establish reliable correspondences between masks and text descriptions, Uni-OVSeg employs the CLIP embedding space and introduces a novel multi-scale ensemble method to stabilize mask-text matching despite the inherent noise in the correspondence.
  • Open-Vocabulary Segmentation: Achieves segmentation across an unrestricted set of vocabulary by embedding target dataset category names and assigning those categories to the predicted masks in a zero-shot learning manner.

Performance and Contributions

Uni-OVSeg notably outperforms previously established weakly-supervised methods across several benchmark datasets, demonstrating substantial improvements (15.5% mIoU) on ADE20K and surpassing fully-supervised methods on the challenging PASCAL Context-459 dataset. The significant advancements brought by Uni-OVSeg are attributed to its ability to align mask-wise embeddings with entity embeddings effectively, its sophisticated handling of the inherent noise in mask-text correspondences, and its refined strategy for mask-text alignment.

Broader Implications

The development of Uni-OVSeg represents a significant leap forward in the pursuit of efficient and scalable open-vocabulary segmentation. By reducing the dependency on labor-intensive annotations and improving segmentation performance, Uni-OVSeg paves the way for more advanced and accessible vision perception systems. Such advancements have profound implications for a wide array of applications, including but not limited to, autonomous driving, content filtering, and assistive technologies, further highlighting the potential of weakly-supervised learning paradigms in advancing the field.

Looking Forward

The research encourages future exploration into minimizing the annotation burden further and improving the robustness and adaptability of segmentation models to unseen categories. Looking ahead, the methods and insights presented by Uni-OVSeg will undoubtedly inspire continued innovation towards creating more sophisticated and practical vision-based AI systems that can navigate the complexity of the real world with unprecedented ease and accuracy.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets