SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation (2211.14813v2)

Published 27 Nov 2022 in cs.CV and cs.AI

Abstract: Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% mIoU), PASCAL Context (+2.3% mIoU), and COCO (+2.2% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.

References (70)

Authors (5)

Huaishao Luo (12 papers)
Junwei Bao (34 papers)
Youzheng Wu (32 papers)
Xiaodong He (162 papers)
Tianrui Li (86 papers)

Citations (120)

View on Semantic Scholar

Summary

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

The paper in question presents SegCLIP, a model designed to tackle the challenge of open-vocabulary semantic segmentation by leveraging the capabilities of the CLIP (Contrastive Language-Image Pre-training) framework. The SegCLIP framework innovatively integrates Vision Transformer (ViT) architectures with a specially designed semantic grouping module, enabling the segmentation of images into semantic regions guided by learnable centers. This proposal is significant in the context of bypassing traditional annotation-heavy processes, achieving segmentation through image-text pairs without pixel-level annotations.

Model Architecture and Training

SegCLIP builds on a dual-encoder approach, comprising a text encoder and an image encoder. Unlike previous models that rely on segmentation decoders or proposal masking frameworks, SegCLIP introduces a semantic group module within the image encoder. This module utilizes a series of cross-attention layers to dynamically allocate learnable centers that aggregate patches into broader semantic regions, effectively transforming pixel-level representations into structured segments. This approach aligns with the CLIP training paradigm but extends its utility to pixel-level tasks.

The training of SegCLIP involves multiple loss functions tailored to enhance visual representation. It integrates a reconstruction loss akin to methodologies observed in Masked Autoencoders (MAE), designed to recover masked patches and strengthen the contextual integrity of visual features. Additionally, a superpixel-based KL divergence loss ensures coherence in the mapping matrix by encouraging consistency within pixel-level superpixel regions derived from unsupervised segmentation. These auxiliary losses are stacked with a traditional contrastive loss to train the model on datasets like Conceptual Captions and COCO, showcasing strong segmentation capabilities without direct segmentation labels.

Experimental Validation

The experimental evaluations show SegCLIP's competency on standard semantic segmentation datasets such as PASCAL VOC 2012, PASCAL Context, and COCO. SegCLIP demonstrates improvements over baseline methods, yielding gains of up to +0.3\% on the VOC dataset and notable performance surges of +2.3\% on Context and +2.2\% on COCO. Particularly, the alignment feature of the model ensures adaptation across arbitrary categories, offering a versatile framework for diverse segmentation tasks beyond curated datasets.

Implications and Future Directions

Practically, SegCLIP's approach removes the dependency on expansive labeled datasets, moving towards scalable, label-efficient semantic segmentation methods powered by text-image pair pre-training. This efficiency opens pathways for deploying semantic segmentation across various domains where obtaining labeled data is challenging. Theoretically, SegCLIP underscores the possible trajectory of AI models to transform class-based tasks into open-domain problems, integrating visual understanding with LLMs.

Future explorations could improve SegCLIP by reducing patch sizes, which can lead to more precise segment boundaries. Additionally, fostering end-to-end training mechanisms and harnessing expansive datasets through post-pretraining can further potentiate SegCLIP’s capabilities. This work contributes significantly to ongoing research in vision-LLMs, offering a scalable architecture that balances theoretical innovation with practical applicability.

PDF Markdown

GitHub

GitHub - ArrowLuo/SegCLIP: PyTorch implementation of ICML 2023 paper "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation" (93 stars)