A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model (2112.14757v2)

Published 29 Dec 2021 in cs.CV

Abstract: Recently, open-vocabulary image classification by vision language pre-training has demonstrated incredible achievements, that the model can classify arbitrary categories without seeing additional annotated images of that category. However, it is still unclear how to make the open-vocabulary recognition work well on broader vision problems. This paper targets open-vocabulary semantic segmentation by building it on an off-the-shelf pre-trained vision-LLM, i.e., CLIP. However, semantic segmentation and the CLIP model perform on different visual granularity, that semantic segmentation processes on pixels while CLIP performs on images. To remedy the discrepancy in processing granularity, we refuse the use of the prevalent one-stage FCN based framework, and advocate a two-stage semantic segmentation framework, with the first stage extracting generalizable mask proposals and the second stage leveraging an image based CLIP model to perform open-vocabulary classification on the masked image crops which are generated in the first stage. Our experimental results show that this two-stage framework can achieve superior performance than FCN when trained only on COCO Stuff dataset and evaluated on other datasets without fine-tuning. Moreover, this simple framework also surpasses previous state-of-the-arts of zero-shot semantic segmentation by a large margin: +29.5 hIoU on the Pascal VOC 2012 dataset, and +8.9 hIoU on the COCO Stuff dataset. With its simplicity and strong performance, we hope this framework to serve as a baseline to facilitate future research. The code are made publicly available at~\url{https://github.com/MendelXu/zsseg.baseline}.

PDF Abstract

Open-Vocabulary Semantic Segmentation via a Two-Stage Approach

The paper "A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-LLM" presents a novel approach to addressing open-vocabulary semantic segmentation. This approach utilizes a vision-LLM called CLIP and adopts a two-stage architecture to enhance generality and adaptability in semantic segmentation tasks, particularly those requiring classification of unseen categories. The authors propose a framework that departs from the widely used fully convolutional network (FCN) paradigm and advocates a two-phase segmentation process: mask proposal extraction followed by semantic classification of these proposals.

Methodology and Approach

The paper addresses the granularity mismatch between image-level understanding provided by models like CLIP and the pixel-level classification required in semantic segmentation tasks. By leveraging a two-stage approach, the research taps into the potential of using mask-based techniques like MaskFormer for generating class-agnostic mask proposals. These proposals are subsequently categorized using the CLIP model, which offers a robust vision-category alignment trained on a large dataset of image-text pairs.

Several methods are explored in generating mask proposals, including GPB-UCM, Selective Search, and MaskFormer, with MaskFormer selected as the default due to its better performance. CLIP is utilized in two ways to classify the masks: directly using the pre-trained model or retraining a vision encoder with a fixed text encoder from CLIP. Ensembling the results from both strategies provides superior performance.

Results and Numerical Highlights

The proposed method is thoroughly evaluated in both zero-shot and cross-dataset settings. In a zero-shot scenario where models are trained on datasets with only seen classes and tested on unseen categories, the framework achieves 37.8 hIoU on the challenging COCO Stuff dataset. Notably, the method exhibits substantial improvements over established zero-shot approaches, outperforming previous state-of-the-art methods by margins of 29.5 hIoU on the Pascal VOC 2012 dataset and 8.9 hIoU on COCO Stuff, showcasing its strong generalization capability.

In cross-dataset tests, such as training on COCO Stuff and evaluating on Cityscapes, Pascal Context, and ADE20k, the two-stage approach presents significant performance advancements over FCN. For instance, on Cityscapes, the model attains a mIoU score of 34.5, compared to 21.4 with the FCN approach.

Implications and Future Directions

The presented framework's simplicity and effectiveness provide a robust baseline for extending the application of large-scale pre-trained vision-LLMs like CLIP to more complex tasks beyond image-level classification. This research offers valuable insights into the adaptability of such models in open-vocabulary scenarios, where exhaustive data labels are impractical. By demonstrating how CLIP's capabilities can be transferred from images to pixels, the approach holds promise for further exploration into semantic understanding in varied domains, which is crucial for advancements in autonomous systems and interactive AI.

Future developments could focus on enhancing the efficiency of mask proposal generation, integrating more complex prompts for improved segmentation, and exploring the balance between seen and unseen classes in novel datasets. Moreover, expanding the granularity adaptation techniques could lead to even broader applications of vision-LLMs in detailed image analysis tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Mengde Xu (8 papers)
Zheng Zhang (486 papers)
Fangyun Wei (53 papers)
Yutong Lin (15 papers)
Yue Cao (147 papers)
Han Hu (196 papers)
Xiang Bai (221 papers)

Citations (182)

View on Semantic Scholar

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model (2112.14757v2)

Open-Vocabulary Semantic Segmentation via a Two-Stage Approach

Methodology and Approach

Results and Numerical Highlights

Implications and Future Directions

Related Papers