Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Auto-Vocabulary Semantic Segmentation (2312.04539v2)

Published 7 Dec 2023 in cs.CV

Abstract: Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-LLMs. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a LLM-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.

Introduction to Semantic Segmentation

Semantic segmentation is a process in computer vision that involves delineating and understanding various parts of an image. The goal is to group pixels into meaningful areas corresponding to real-world categories. Traditional models for this task are trained on specific datasets with predefined categories, limiting their ability to recognize new or unexpected object types. With the advent of Vision-LLMs (VLMs), this limitation is being overcome. VLMs are trained using image-text pairs which gives them a broad understanding of various objects but integrating them into pixel-level segmentation tasks presents challenges due to their natural insufficiency in dealing with fine-grained details.

Bridging the Gap with Self-Guided Semantic Segmentation

The paper presents an innovative framework known as Self-Guided Semantic Segmentation (Self-Seg), which facilitates semantic segmentation of images without the need for direct textual input. Traditional segmentation relies on predefined categories or textual instructions provided during testing to guide the segmentation process. Self-Seg moves beyond these constraints by generating relevant class names automatically from the images themselves for accurate segmentation. This process is executed through Self-Seg's ability to detect class names from BLIP embeddings, grouping them into meaningful clusters.

Methodology: From Clustering to Caption Generation

Self-Seg engages with BLIP (Bootstrapped Language Image Pretraining) embeddings across different scales, subsequently grouping them into clusters. It uses an image-captioning technique that the clusters to generate descriptive nouns. These nouns serve as class labels input for a pre-trained segmentation model, essentially steering it without additional training. The Self-Seg framework introduces a sub-method, BLIP-Cluster-Caption (BCC), which extracts nouns from clusters' captions. Finally, the framework includes an evaluation method called LOVE – a LLM-based Open-Vocabulary Evaluator that repurposes open-vocabulary predictions into dataset-specific class names.

Exemplary Results and Contributions

Self-Seg demonstrates its effectiveness on several benchmarks – Pascal VOC, ADE20K, and CityScapes – setting new standards for self-guided open-vocabulary segmentation. It performs competitively when compared with other methodologies that operate with predefined textual inputs. The framework’s contributions are multi-fold: it automates the process of identifying and segmenting relevant objects, introduces BCC for generating contextually rich captions, and proposes the LOVE evaluator for handling open-vocabulary segments. The release of the code ensures transparency, further research, and potential development of self-guided semantic segmentation models.

In summary, Self-Seg is a groundbreaking step toward more sophisticated and autonomous image understanding, with promising results in comprehensively assessing and delineating images without conventional constraints. This new approach hints at the potential of integrating vision and LLMs in creative ways to achieve tasks that previously required extensive human supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Zero-shot semantic segmentation. In NeurIPS, 2019.
  2. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  3. Open-vocabulary panoptic segmentation with embedding modulation. In ICCV, 2023.
  4. Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115, 2023a.
  5. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023b.
  6. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  7. Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262, 2023.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. The pascal visual object classes (voc) challenge. IJCV, 2010.
  10. Open-vocabulary image segmentation. ECCV, 2022.
  11. Diffusion Models for Zero-Shot Open-Vocabulary Segmentation. arXiv preprint arXiv:2306.09316, 2023.
  12. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
  13. Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  14. Language-driven semantic segmentation. ICLR, 2022a.
  15. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
  16. Open-vocabulary semantic segmentation with mask-adapted clip. CVPR, 2023.
  17. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
  18. Open-vocabulary semantic segmentation with frozen vision-language models. In BMVC, 2022.
  19. J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297, 1967.
  20. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  21. A language-guided benchmark for weakly supervised open vocabulary semantic segmentation. arXiv preprint arXiv:2302.14163, 2023.
  22. Learning transferable visual models from natural language supervision. In ICML, 2021.
  23. Sentence-bert: Sentence embeddings using siamese bert-networks. EMNLP, 2019.
  24. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. In NeurIPS, 2023.
  25. Zero-guidance segmentation using zero segment labels. In ICCV, 2023.
  26. Grounding of textual phrases in images by reconstruction. In ECCV, 2016.
  27. What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs. NeurIPS, 2022.
  28. Going denser with open-vocabulary part segmentation. In ICCV, 2023.
  29. Semantic projection network for zero- and few-label semantic segmentation. In CVPR, 2019.
  30. Groupvit: Semantic segmentation emerges from text supervision. CVPR, 2022a.
  31. Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
  32. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. ICCV, 2023b.
  33. A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. ECCV, 2022b.
  34. Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
  35. A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
  36. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In NeurIPS, 2023.
  37. Scene parsing through ade20k dataset. In CVPR, 2017.
  38. Generalized decoding for pixel, image, and language. CVPR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Osman Ülger (4 papers)
  2. Maksymilian Kulicki (2 papers)
  3. Yuki Asano (33 papers)
  4. Martin R. Oswald (69 papers)
Citations (2)