Auto-Vocabulary Semantic Segmentation (2312.04539v2)
Abstract: Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-LLMs. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a LLM-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.
- Zero-shot semantic segmentation. In NeurIPS, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Open-vocabulary panoptic segmentation with embedding modulation. In ICCV, 2023.
- Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115, 2023a.
- Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023b.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- The pascal visual object classes (voc) challenge. IJCV, 2010.
- Open-vocabulary image segmentation. ECCV, 2022.
- Diffusion Models for Zero-Shot Open-Vocabulary Segmentation. arXiv preprint arXiv:2306.09316, 2023.
- Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
- Language-driven semantic segmentation. ICLR, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- Open-vocabulary semantic segmentation with mask-adapted clip. CVPR, 2023.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
- Open-vocabulary semantic segmentation with frozen vision-language models. In BMVC, 2022.
- J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297, 1967.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP, 2019.
- Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. In NeurIPS, 2023.
- Zero-guidance segmentation using zero segment labels. In ICCV, 2023.
- Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
- What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. NeurIPS, 2022.
- Going denser with open-vocabulary part segmentation. In ICCV, 2023.
- Semantic projection network for zero- and few-label semantic segmentation. In CVPR, 2019.
- Groupvit: Semantic segmentation emerges from text supervision. CVPR, 2022a.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
- Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. ICCV, 2023b.
- A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. ECCV, 2022b.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
- A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Generalized decoding for pixel, image, and language. CVPR, 2023.
- Osman Ülger (4 papers)
- Maksymilian Kulicki (2 papers)
- Yuki Asano (33 papers)
- Martin R. Oswald (69 papers)