Auto-Vocabulary Semantic Segmentation

Published 7 Dec 2023 in cs.CV | (2312.04539v3)

Abstract: Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a LLM-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated classes and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names.

Abstract PDF HTML Upgrade to Chat

Authors (4)

References (38)

Citations (2)

View on Semantic Scholar

Summary

The paper presents Self-Seg, a framework that auto-generates class names from BLIP embeddings for effective open-vocabulary semantic segmentation.
It employs a novel BLIP-Cluster-Caption method to convert image clusters into descriptive nouns that guide pre-trained segmentation models.
Benchmarked on Pascal VOC, ADE20K, and CityScapes, Self-Seg sets competitive standards without relying on predefined textual inputs.

Introduction to Semantic Segmentation

Semantic segmentation is a process in computer vision that involves delineating and understanding various parts of an image. The goal is to group pixels into meaningful areas corresponding to real-world categories. Traditional models for this task are trained on specific datasets with predefined categories, limiting their ability to recognize new or unexpected object types. With the advent of Vision-LLMs (VLMs), this limitation is being overcome. VLMs are trained using image-text pairs which gives them a broad understanding of various objects but integrating them into pixel-level segmentation tasks presents challenges due to their natural insufficiency in dealing with fine-grained details.

Bridging the Gap with Self-Guided Semantic Segmentation

The paper presents an innovative framework known as Self-Guided Semantic Segmentation (Self-Seg), which facilitates semantic segmentation of images without the need for direct textual input. Traditional segmentation relies on predefined categories or textual instructions provided during testing to guide the segmentation process. Self-Seg moves beyond these constraints by generating relevant class names automatically from the images themselves for accurate segmentation. This process is executed through Self-Seg's ability to detect class names from BLIP embeddings, grouping them into meaningful clusters.

Methodology: From Clustering to Caption Generation

Self-Seg engages with BLIP (Bootstrapped Language Image Pretraining) embeddings across different scales, subsequently grouping them into clusters. It uses an image-captioning technique that the clusters to generate descriptive nouns. These nouns serve as class labels input for a pre-trained segmentation model, essentially steering it without additional training. The Self-Seg framework introduces a sub-method, BLIP-Cluster-Caption (BCC), which extracts nouns from clusters' captions. Finally, the framework includes an evaluation method called LOVE – a LLM-based Open-Vocabulary Evaluator that repurposes open-vocabulary predictions into dataset-specific class names.

Exemplary Results and Contributions

Self-Seg demonstrates its effectiveness on several benchmarks – Pascal VOC, ADE20K, and CityScapes – setting new standards for self-guided open-vocabulary segmentation. It performs competitively when compared with other methodologies that operate with predefined textual inputs. The framework’s contributions are multi-fold: it automates the process of identifying and segmenting relevant objects, introduces BCC for generating contextually rich captions, and proposes the LOVE evaluator for handling open-vocabulary segments. The release of the code ensures transparency, further research, and potential development of self-guided semantic segmentation models.

In summary, Self-Seg is a groundbreaking step toward more sophisticated and autonomous image understanding, with promising results in comprehensively assessing and delineating images without conventional constraints. This new approach hints at the potential of integrating vision and LLMs in creative ways to achieve tasks that previously required extensive human supervision.

Markdown Report Issue