Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation (2309.13505v4)
Abstract: Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4253–4262, 2020.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. GitHub, 2022.
- Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019.
- Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. arXiv preprint arXiv:2212.00785, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–308, 2009.
- From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1473–1482, 2015.
- Large-scale unsupervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Open-vocabulary image segmentation. arXiv preprint arXiv:2112.12143, 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 540–557. Springer, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
- Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems, 33:10317–10327, 2020.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
- Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17990–17999, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 275–292. Springer, 2022.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
- Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 529–544. Springer, 2022.
- Open vocabulary semantic segmentation with patch aligned contrastive learning. arXiv preprint arXiv:2212.04994, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307, 2023.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Namedmask: Distilling segmenters from complementary foundation models. arXiv preprint arXiv:2209.11228, 2022.
- Reco: Retrieve and co-segment for zero-shot transfer. arXiv preprint arXiv:2206.07045, 2022.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Data efficient language-supervised zero-shot recognition with optimal transport distillation. In International Conference on Learning Representations, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. arXiv preprint arXiv:2301.09121, 2023.
- Side adapter network for open-vocabulary semantic segmentation. arXiv preprint arXiv:2302.12242, 2023.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIX, pages 736–753. Springer, 2022.
- Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 696–712. Springer, 2022.
- Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.
- Yun Xing (14 papers)
- Jian Kang (142 papers)
- Aoran Xiao (24 papers)
- Jiahao Nie (17 papers)
- Ling Shao (244 papers)
- Shijian Lu (151 papers)