CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation (2212.09506v3)

Published 16 Dec 2022 in cs.CV and cs.AI

Abstract: Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.

Citations (87)

View on Semantic Scholar

Summary

The paper introduces CLIP-ES, which repurposes the CLIP model to perform weakly supervised semantic segmentation, reducing annotation costs.
It integrates softmax-enhanced GradCAM, sharpness-based prompt selection, synonym fusion, and an attention-based affinity module to refine segmentation masks.
The framework achieves over 10x computational efficiency and high mIoU on benchmarks like PASCAL VOC and MS COCO, demonstrating its practical impact.

Review of "CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation"

The paper "CLIP is Also an Efficient Segmenter" introduces a novel approach to Weakly Supervised Semantic Segmentation (WSSS) by leveraging the Contrastive Language-Image Pre-training (CLIP) model. The significance of this work lies in its framework, termed CLIP-ES, which harnesses the unique capabilities of CLIP, originally designed for zero-shot image classification, to localize and segment objects in images using only image-level labels. This provides an opportunity to sidestep the extensive annotation costs associated with fully supervised semantic segmentation.

Key Contributions

The paper highlights several innovative strategies developed to optimize the application of CLIP in the WSSS context:

Softmax-Enhanced GradCAM:
- The authors integrate the softmax function into GradCAM, enhancing its ability to make categorical decisions mutually exclusive. This adjustment is shown to significantly reduce class confusion in CAMs, especially in identifying target objects separate from background elements.
Text-Driven Strategies:
- Sharpness-Based Prompt Selection: This involves selecting prompts that minimize the score variance across classes, thereby optimizing CLIP's performance in multi-label settings.
- Synonym Fusion: By incorporating synonyms, the semantic richness is enhanced, leading to more accurate CAMs.
Class-Aware Attention-Based Affinity Module (CAA):
- The CAA module refines CAMs in real-time by leveraging the multi-head self-attention (MHSA) from the ViT architecture in CLIP. This method effectively bridges the gap between class-agnostic attention maps and class-specific CAMs.
Confidence-Guided Loss (CGL):
- This strategy selectively ignores low-confidence predictions during the training of the segmentation model, mitigating noise from pseudo-labels and improving segmentation accuracy.

Numerical Results and Implications

The proposed framework's efficacy is evidenced by state-of-the-art performance on established benchmarks, achieving 73.8% and 73.9% mIoU on the PASCAL VOC 2012 validation and test sets respectively, and outperforming prior works on the MS COCO 2014 dataset with 45.4% mIoU. Importantly, CLIP-ES operates with a significant reduction in computational cost, achieving over 10x efficiency gains compared to traditional multi-stage approaches.

The introduction of CLIP-ES demonstrates a remarkable reduction in dependency on multiple training stages typically required in WSSS. By successfully transferring CLIP's capabilities into the field of semantic segmentation without further training across datasets, this work not only highlights the flexibility of contrastive pre-trained models but also sets a precedent for future exploration in leveraging large-scale pre-trained models in novel ways.

Future Directions

The adaptability and efficiency of CLIP-ES suggest new pathways for developing WSSS models that generalize across diverse datasets and object classes. Future research can further explore:

Extending the framework to other pre-trained models with similar or potentially greater alignment capabilities.
Investigating the integration of more complex natural language inputs to refine segmentation outputs further.
Developing techniques to incorporate dynamic background suppression tailored to specific domain applications.

In conclusion, the integration of language-driven strategies with WSSS as championed by this paper underscores the growing intersection of language understanding in visual tasks, paving the way for more sophisticated and resource-efficient computer vision solutions.

PDF Markdown

Related Papers

GitHub

GitHub - linyq2117/CLIP-ES (180 stars)