Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP (2210.04150v3)

Published 9 Oct 2022 in cs.CV and cs.LG

Abstract: Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-LLMs, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations.

PDF Abstract

An Academic Overview of Open-Vocabulary Semantic Segmentation with Mask-Adapted CLIP

This paper explores the development of open-vocabulary semantic segmentation leveraging a novel adaptation of the Contrastive Language–Image Pre-training (CLIP) model. The primary objective of this research is to address the challenge of segmenting images into semantic regions following arbitrary text descriptions, regardless of whether these categories were part of the training set.

The authors start by identifying a pivotal performance bottleneck in existing two-stage approaches to this task. Typically, these methods first generate class-agnostic mask proposals and subsequently employ pre-trained CLIP models to classify the segmented regions. However, the research demonstrates that simply applying a pre-trained CLIP model to masked images is inadequate due to significant inefficiencies that arise from the disparity between CLIP's training data—composed mainly of natural images—and the masked images used in segmentation tasks.

To overcome this bottleneck, the paper proposes several key innovations:

CLIP Model Finetuning: The authors refine the CLIP model specifically to handle masked images by fine-tuning it using image-caption pairs from existing datasets. This adaptation aligns the model more closely with its task environment while retaining its open-vocabulary capabilities.
Dataset Collection Strategy: A major contribution is the authors’ innovative method of dataset creation by mining existing datasets like COCO Captions to generate masked image regions paired with text from which CLIP can learn. This avoids the limitations of manual annotation, offering greater class diversity and maintaining generalization capacity.
Mask Prompt Tuning: Another novel method introduced is mask prompt tuning. This process involves replacing masked regions within image inputs with learnable prompt tokens instead of 'zero tokens'. This improves CLIP’s ability to process masked images while retaining scalable performance improvements, even with models that are fully fine-tuned.

The experimental results illustrated the superiority of the proposed approach. When trained on COCO and tested on ADE20K-150, the finest iteration of this model attained a mean Intersection over Union (mIoU) of 29.6%, achieving an 8.5% improvement over prior leading methods. This level of effectiveness shows that open-vocabulary generalist models can indeed compete with the results of supervised specialist models from 2017, without needing dataset-specific adaptations.

The research carries profound implications in both theoretical and practical domains. Theoretically, the paper offers insights into the adaptability and versatility of vision-LLMs when extended to tasks beyond their conventional scope. Practically, the improved performance of semantic segmentation models opens the door to more adaptable and robust applications across diverse fields such as autonomous driving, smart surveillance systems, and advanced human-computer interaction interfaces.

Moreover, the authors suggest that this progression in model adaptability could catalyze further innovations in multi-task scenarios where model weights are shared across different tasks. In a broader scope, this research highlights a promising path forward for improving AI systems’ ability to perform complex, multimodal reasoning across tasks.

In summary, this paper makes a significant empirical and methodological contribution to the field of computer vision and artificial intelligence, providing key strategies for overcoming prevalent challenges in open-vocabulary semantic segmentation tasks. The results achieved underscore the potential of leveraging adaptive strategies to enhance large pre-trained models’ ability to process domain-specific inputs effectively.

Leveraging a robust methodological framework, this paper advances the state of open-vocabulary semantic segmentation by showcasing innovative techniques and achieving substantial empirical advancements. This facilitates the convergence of vision and language in AI models and sets a high bar for subsequent research in semantic segmentation and related domains.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Feng Liang (61 papers)
Bichen Wu (52 papers)
Xiaoliang Dai (44 papers)
Kunpeng Li (29 papers)
Yinan Zhao (29 papers)
Hang Zhang (164 papers)
Peizhao Zhang (40 papers)
Peter Vajda (52 papers)
Diana Marculescu (64 papers)

Citations (349)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP