Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision (2402.08960v2)
Abstract: Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs of masks and text entities. We then train a feature adapter to align region embeddings with text embeddings based on these pseudo labels, achieving open-vocabulary segmentation. However, the inherent noise in the mask-entity correspondence poses a challenge to obtaining reliable pairs. To address this, we employ a vision-language large model to re-caption the input images and extract precise entities, and we design a multi-scale matching strategy to reduce noisy mask-entity pairs. Our Unpair-Seg framework demonstrates impressive performance, achieving 14.6\% and 19.5\% mIoU on the ADE-847 and PASCAL Context-459 datasets, significantly narrowing the gap between fully-supervised and weakly-supervised methods.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022.
- Rsa: Reducing semantic shift from aggressive augmentations for self-supervised learning. NeurIPS, 35:21128–21141, 2022.
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
- Yolact: Real-time instance segmentation. In ICCV, pages 9157–9166, 2019.
- Mixreorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In ICCV, pages 1196–1205, 2023.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, pages 11165–11174, 2023.
- Hybrid task cascade for instance segmentation. In CVPR, pages 4974–4983, 2019.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
- A generalist framework for panoptic segmentation of images and videos. In ICCV, pages 909–919, 2023b.
- Open-vocabulary panoptic segmentation with embedding modulation. ICCV, 2023c.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, pages 12475–12485, 2020.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pages 17864–17875, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
- Domain adaptation for traffic density estimation. In VISIGRAPP (5: VISAPP), pages 185–195, 2021.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- Decoupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022.
- Open-vocabulary universal image segmentation with maskclip. 2023.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88:303–338, 2010.
- Instance segmentation for autonomous log grasping in forestry operations. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6064–6071. IEEE, 2022.
- Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557, 2022.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- Trashcan: A semantically-segmented dataset towards visual detection of marine debris. arXiv preprint arXiv:2007.08097, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Pointshift: Point-wise shift mlp for pixel-level cloud type classification in meteorological satellite imagery. In IGARSS, pages 607–610. IEEE, 2022.
- Openclip, 2021. If you use this software, please cite it as below.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
- An optimal algorithm for on-line bipartite matching. In STOC, pages 352–358, 1990.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Panoptic segmentation. In CVPR, pages 9404–9413, 2019.
- Segment anything. ICCV, 2023.
- Language-driven semantic segmentation. In ICLR, 2022a.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
- Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, pages 1280–1289, 2022b.
- Gmmseg: Gaussian mixture based generative semantic segmentation models. In NeurIPS, pages 31360–31375, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044, 2023.
- Unsupervised universal image segmentation. arXiv preprint arXiv:2312.17243, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Perceptual grouping in contrastive vision-language models. In ICCV, pages 5571–5584, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- Streets: A novel camera network dataset for traffic flow. NeurIPS, 32, 2019.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Conditional convolutions for instance segmentation. In ECCV, pages 282–298, 2020.
- Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):669–680, 2022.
- Attention is all you need. NeurIPS, 30, 2017.
- Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, pages 108–126, 2020.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, pages 5463–5474, 2021a.
- Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023a.
- Solo: A simple framework for instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8587–8601, 2021b.
- Cut and learn for unsupervised object detection and instance segmentation. In CVPR, pages 3124–3134, 2023b.
- Mosaic representation learning for self-supervised visual pre-training. In ICLR, 2022a.
- Exploring set similarity for dense self-supervised representation learning. In CVPR, pages 16590–16599, 2022b.
- Cris: Clip-driven referring image segmentation. In CVPR, pages 11686–11695, 2022c.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pages 18134–18144, 2022a.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, pages 2935–2944, 2023a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, pages 2955–2966, 2023b.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753, 2022b.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023c.
- ishape: A first step towards irregular shape instance segmentation. arXiv preprint arXiv:2109.15068, 2021a.
- Objects in semantic topology. In ICLR, 2021b.
- Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9308–9318, 2019.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
- Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, pages 2560–2570, 2022b.
- k-means mask transformer. In ECCV, pages 288–307, 2022c.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023.
- Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, 129(8):2375–2398, 2021.
- Segvit: Semantic segmentation with plain vision transformers. In NeurIPS, pages 4971–4982, 2022a.
- Context encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018.
- A simple framework for open-vocabulary segmentation and detection. In ICCV, pages 1020–1031, 2023a.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In European Conference on Computer Vision, pages 127–145. Springer, 2022b.
- Pidray: A large-scale x-ray benchmark for real-world prohibited item detection. International Journal of Computer Vision, 131(12):3170–3192, 2023b.
- K-net: Towards unified image segmentation. NeurIPS, 34:10326–10338, 2021.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Extract free dense labels from clip. In ECCV, pages 696–712, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Generalized decoding for pixel, image, and language. In CVPR, pages 15116–15127, 2023a.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023b.
- Zhaoqing Wang (15 papers)
- Xiaobo Xia (43 papers)
- Ziye Chen (5 papers)
- Xiao He (54 papers)
- Yandong Guo (78 papers)
- Mingming Gong (135 papers)
- Tongliang Liu (251 papers)