Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decoupling Zero-Shot Semantic Segmentation (2112.07910v2)

Published 15 Dec 2021 in cs.CV

Abstract: Zero-shot semantic segmentation (ZS3) aims to segment the novel categories that have not been seen in the training. Existing works formulate ZS3 as a pixel-level zeroshot classification problem, and transfer semantic knowledge from seen classes to unseen ones with the help of LLMs pre-trained only with texts. While simple, the pixel-level ZS3 formulation shows the limited capability to integrate vision-LLMs that are often pre-trained with image-text pairs and currently demonstrate great potential for vision tasks. Inspired by the observation that humans often perform segment-level semantic labeling, we propose to decouple the ZS3 into two sub-tasks: 1) a classagnostic grouping task to group the pixels into segments. 2) a zero-shot classification task on segments. The former task does not involve category information and can be directly transferred to group pixels for unseen classes. The latter task performs at segment-level and provides a natural way to leverage large-scale vision-LLMs pre-trained with image-text pairs (e.g. CLIP) for ZS3. Based on the decoupling formulation, we propose a simple and effective zero-shot semantic segmentation model, called ZegFormer, which outperforms the previous methods on ZS3 standard benchmarks by large margins, e.g., 22 points on the PASCAL VOC and 3 points on the COCO-Stuff in terms of mIoU for unseen classes. Code will be released at https://github.com/dingjiansw101/ZegFormer.

Decoupling Zero-Shot Semantic Segmentation: ZegFormer

The paper introduces ZegFormer, a model designed to address Zero-shot Semantic Segmentation (ZS3) by leveraging a novel decoupling approach. ZS3, as a task, is focused on segmenting categories of objects that the model has not encountered during its training phase. Traditional approaches consider ZS3 at the pixel-level, treating it as a classification problem by utilizing semantic features from LLMs. However, these methods are limited in integrating large-scale vision-LLMs such as CLIP, which are pre-trained with image-text pairs.

ZegFormer proposes a paradigm shift by decoupling the ZS3 task into two sub-tasks: a class-agnostic grouping task and a segment-level zero-shot classification task. This decoupling enables the usage of pre-trained large-scale vision-LLMs more effectively and aligns more closely with human semantic labeling processes, which are generally segment-level rather than pixel-level.

Decoupling Strategy

  1. Class-Agnostic Grouping: This sub-task aims to cluster pixels into segments without involving any category-specific information. Techniques rooted in classical image grouping can be applied, which allows easy transferability to unseen classes.
  2. Segment-Level Zero-Shot Classification: Once pixels are grouped into segments, this task assigns semantic labels to each segment. It leverages vision-LLMs like CLIP, which are pre-trained on image-text pairs, thereby aligning segment-level visual data with textual data.

By adopting this strategy, ZegFormer significantly outperforms existing methods, achieving superior results on standard benchmarks. For instance, it improves the mean Intersection-over-Union (mIoU) for unseen classes by 22 points on PASCAL VOC and 3 points on COCO-Stuff, which is a notable achievement in the zero-shot learning domain.

Implications and Future Directions

The implications of this research are multifaceted. Practically, ZegFormer's method allows for more robust semantic segmentation models that can efficiently classify unseen classes without retraining, broadening the applicability of such models in real-time, dynamic environments. Theoretically, this work opens the door for further explorations into segmentation models that utilize similar decoupling methodologies. Moreover, the successful integration of vision-LLMs hints at future developments where cross-modal pretraining is leveraged across various computer vision tasks, not just segmentation.

ZegFormer also sets a precedent for exploring segmentation models that prioritize segment-level over pixel-level analysis, which may lead to more natural and interpretable models aligning with human perception. As vision-language pre-training continues to advance, ZegFormer-style approaches may extend the state-of-the-art across several domains, potentially converging tasks like object detection and semantic segmentation into a unified framework.

In conclusion, the ZegFormer introduces a compelling approach to ZS3, showcasing the power of segment-level analysis and setting a new benchmark for future research in zero-shot learning and vision-LLM integration in computer vision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jian Ding (132 papers)
  2. Nan Xue (61 papers)
  3. Gui-Song Xia (139 papers)
  4. Dengxin Dai (99 papers)
Citations (171)