Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation (2212.03588v3)

Published 7 Dec 2022 in cs.CV
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

Abstract: Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.

An Analysis of ZegCLIP: Adapting CLIP for Zero-shot Semantic Segmentation

The paper, "ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation," introduces a novel adaptation of the CLIP (Contrastive Language–Image Pre-training) model, termed ZegCLIP, aimed at enhancing pixel-level zero-shot semantic segmentation. This work emerges from the growing need to automate semantic segmentation processes, a fundamental task in computer vision involving the categorization of each pixel within an image, typically reliant on substantial annotated data.

Methodology Overview

The paper addresses limitations in applying CLIP to pixel-level tasks through a one-stage methodological innovation, contrasting the traditionally adopted two-stage methodologies like zsseg and Zegformer. The previous strategies involved generating class-agnostic region proposals followed by zero-shot classification on each proposal with CLIP, which, though effective, demanded high computational costs and complex procedural pipelines.

ZegCLIP bypasses this complexity by directly extending CLIP's inherent zero-shot prediction capabilities to pixel-level tasks via a single-stage approach. This involves matching text and patch embeddings from CLIP without requiring a separate proposal generation phase. The key challenge addressed is overfitting to seen classes, which the research resolves through three design innovations:

  1. Deep Prompt Tuning (DPT): Instead of fine-tuning CLIP's image encoder, DPT is used to retain the model's zero-shot capabilities and mitigate overfitting.
  2. Non-mutually Exclusive Loss (NEL): This new loss function circumvents conventional softmax limitations by treating class predictions independently, facilitating generalization to unseen classes.
  3. Relationship Descriptor (RD): This technique merges text and image embeddings from CLIP, aiding robust generalization across classes.

Empirical Evaluation

The comprehensive experiments conducted across three datasets—PASCAL VOC 2012, COCO-Stuff 164K, and PASCAL Context—verify the supremacy of ZegCLIP over existing methods in both standard "inductive" and "transductive" zero-shot learning settings. Notably, ZegCLIP significantly surpasses state-of-the-art alternatives with substantial improvements, showcasing its superior generalization potential on unseen classes. Quantitative results underscore this, with ZegCLIP achieving substantial mIoU scores and reporting a fivefold inference speedup relative to the two-stage counterparts, highlighting its efficiency.

Implications and Future Directions

Practically, ZegCLIP's efficient model design promises significant computational savings, making it a potentially favorable option for real-time applications where computing resources are limited. Theoretically, the paper's insights into preserving zero-shot capabilities through prompt tuning and innovative loss functions indicate pathways for integrating pre-trained vision-LLMs into other dense prediction tasks.

Looking forward, this work opens avenues for further exploration into enhancing the generalization properties of vision-language pre-trained models like CLIP towards diverse computer vision applications. The successful application of combining embeddings at the pixel level could inform future efforts in leveraging large pre-trained models for complex vision tasks without the prerequisite of extensive data annotation efforts.

In conclusion, ZegCLIP exemplifies a methodological pivot that intelligently harnesses CLIP's pre-trained knowledge, marking a step forward in zero-shot semantic segmentation research. The design principles it introduces could serve as a blueprint for future enhancements within the broader field of zero-shot learning in AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9536–9545, 2021.
  2. Language models are few-shot learners. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  3. Zero-shot semantic segmentation. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
  4. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
  5. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834–848, 2017.
  6. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  7. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2613–2622, 2021.
  8. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1299, 2022.
  9. Per-pixel classification is not all you need for semantic segmentation. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 34:17864–17875, 2021.
  10. SIGN: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9556–9566, 2021.
  11. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  12. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Zero-shot out-of-distribution detection based on the pretrained model clip. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2022.
  15. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9716–9725, 2021.
  16. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12094–12103, 2022.
  17. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the ACM International Conference on Multimedia (ACMMM), pages 1921–1929, 2020.
  18. On pre-trained image features and synthetic images for deep learning. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4904–4916, 2021.
  20. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
  21. Finetuning pretrained vision-language models with correlation information bottleneck for robust visual question answering. arXiv preprint arXiv:2209.06954, 2022.
  22. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  23. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855, 2019.
  24. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  25. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
  26. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 82–92, 2019.
  27. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the Association for Computational Linguistics (ACL), pages 61–68, 2022.
  28. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2125–2134, 2021.
  29. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  30. Video object segmentation with episodic graph memory networks. In Proceedings of the IEEE conference on European Conference on Computer Vision (ECCV), pages 661–679. Springer, 2020.
  31. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8364–8375, 2022.
  32. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), pages 565–571. IEEE, 2016.
  33. Segmentation in style: Unsupervised semantic image segmentation with stylegan and clip. arXiv preprint arXiv:2107.12518, 2021.
  34. A closer look at self-training for zero-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2693–2702, 2021.
  35. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
  36. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021.
  37. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18082–18091, 2022.
  38. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 234–241, 2015.
  39. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7262–7272, 2021.
  40. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  41. Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10052–10062, 2021.
  42. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the International Conference on Software Engineering (ICSE), pages 287–298, 2022.
  43. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11686–11695, 2022.
  44. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8256–8265, 2019.
  45. Segformer: Simple and efficient design for semantic segmentation with transformers. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 34:12077–12090, 2021.
  46. Scale-aware graph neural network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5475–5484, 2021.
  47. Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340, 2022.
  48. Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6984–6993, 2021.
  49. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112.14757, 2021.
  50. CPT: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  51. Segvit: Semantic segmentation with plain vision transformers. arXiv preprint arXiv:2210.05844, 2022.
  52. Context encoding for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7151–7160, 2018.
  53. Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12083–12093, 2022.
  54. Rethinking dice loss for medical image segmentation. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 851–860. IEEE, 2020.
  55. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6881–6890, 2021.
  56. Extract free dense labels from clip. In Proceedings of the IEEE conference on European Conference on Computer Vision (ECCV), 2022.
  57. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9):2337–2348, 2022.
  58. A unified efficient pyramid transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2667–2677, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ziqin Zhou (12 papers)
  2. Bowen Zhang (161 papers)
  3. Yinjie Lei (30 papers)
  4. Lingqiao Liu (113 papers)
  5. Yifan Liu (134 papers)
Citations (134)