Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation (2211.14813v2)

Published 27 Nov 2022 in cs.CV and cs.AI

Abstract: Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% mIoU), PASCAL Context (+2.3% mIoU), and COCO (+2.2% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pp.  1708–1718, 2021.
  2. Zero-shot semantic segmentation. In NeurIPS, 2019.
  3. Emerging properties in self-supervised vision transformers. In ICCV, pp.  9630–9640, 2021.
  4. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pp.  3558–3568, 2021.
  5. Transformer interpretability beyond attention visualization. In CVPR, pp.  782–791, 2021.
  6. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  7. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  8. An empirical study of training self-supervised vision transformers. In ICCV, pp.  9620–9629, 2021.
  9. UNITER: universal image-text representation learning. In ECCV, volume 12375, pp.  104–120, 2020.
  10. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pp.  17864–17875, 2021.
  11. Masked-attention mask transformer for universal image segmentation. In CVPR, pp.  1290–1299, 2022.
  12. The cityscapes dataset for semantic urban scene understanding. In CVPR, pp.  3213–3223, 2016.
  13. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp.  4171–4186, 2019.
  14. Decoupling zero-shot semantic segmentation. In CVPR, pp.  11573–11582, 2022a.
  15. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022b.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010.
  18. Efficient graph-based image segmentation. International journal of computer vision, 59(2):167–181, 2004.
  19. Vision-language pre-training: Basics, recent advances, and future trends. arXiv preprint arXiv:2210.09263, 2022.
  20. Scaling open-vocabulary image segmentation with image-level labels. arXiv:2112.12143, 2021.
  21. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  22. Masked autoencoders are scalable vision learners. In CVPR, pp.  15979–15988, 2022.
  23. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  24. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
  25. Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:abs/2211.06220, 2022.
  26. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, volume 139, pp.  4904–4916, 2021.
  28. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, volume 139, pp.  5583–5594, 2021.
  29. Language-driven semantic segmentation. In ICLR, 2022a.
  30. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, volume 162, pp.  12888–12900, 2022b.
  31. LAVENDER: unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022c.
  32. Grounded language-image pre-training. In CVPR, pp.  10955–10965, 2022d.
  33. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In ACL/IJCNLP, pp.  2592–2607, 2021.
  34. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022e.
  35. Open-vocabulary semantic segmentation with mask-adapted CLIP. arXiv preprint arXiv:abs/2210.04150, 2022.
  36. Microsoft COCO: common objects in context. In ECCV, volume 8693, pp.  740–755, 2014.
  37. Fully convolutional networks for semantic segmentation. In CVPR, pp.  3431–3440, 2015.
  38. Image segmentation using text and image prompts. In CVPR, pp.  7076–7086, 2022.
  39. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  40. Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:abs/2210.15138, 2022.
  41. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pp.  2630–2640, 2019.
  42. The role of context for object detection and semantic segmentation in the wild. In CVPR, pp.  891–898, 2014.
  43. Learning transferable visual models from natural language supervision. In ICML, volume 139, pp.  8748–8763, 2021.
  44. Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pp.  18061–18070, 2022.
  45. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pp.  234–241, 2015.
  46. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  47. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pp.  618–626, 2017.
  48. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp.  2556–2565, 2018.
  49. Videobert: A joint model for video and language representation learning. In ICCV, pp.  7463–7472, 2019.
  50. LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  51. YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, 2016.
  52. Training data-efficient image transformers & distillation through attention. In ICML, volume 139, pp.  10347–10357, 2021.
  53. Attention is all you need. In NeurIPS, pp.  5998–6008, 2017.
  54. Bevt: Bert pretraining of video transformers. In CVPR, pp.  14733–14743, 2022a.
  55. Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022b.
  56. Self-supervised visual representation learning with semantic grouping. In NeurIPS, 2022.
  57. Phrasecut: Language-based image segmentation in the wild. In CVPR, pp.  10213–10222, 2020.
  58. Semantic projection network for zero- and few-label semantic segmentation. In CVPR, pp.  8256–8265, 2019.
  59. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, pp.  12077–12090, 2021.
  60. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pp.  18134–18144, 2022a.
  61. A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. ECCV, 2022b.
  62. FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
  63. Semantic segmentation in-the-wild without seeing any segmentation examples. arXiv:2112.03185, 2021.
  64. Multi-grained vision language pre-training: Aligning texts with visual concepts. In ICML, volume 162, pp.  25994–26009, 2022.
  65. Pyramid scene parsing network. In CVPR, pp.  6230–6239, 2017.
  66. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  67. Regionclip: Region-based language-image pretraining. In CVPR, pp.  16772–16782, 2022.
  68. Scene parsing through ade20k dataset. In CVPR, pp.  633–641, 2017.
  69. Extract free dense labels from clip. In ECCV, 2022a.
  70. iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Huaishao Luo (12 papers)
  2. Junwei Bao (34 papers)
  3. Youzheng Wu (32 papers)
  4. Xiaodong He (162 papers)
  5. Tianrui Li (86 papers)
Citations (120)

Summary

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

The paper in question presents SegCLIP, a model designed to tackle the challenge of open-vocabulary semantic segmentation by leveraging the capabilities of the CLIP (Contrastive Language-Image Pre-training) framework. The SegCLIP framework innovatively integrates Vision Transformer (ViT) architectures with a specially designed semantic grouping module, enabling the segmentation of images into semantic regions guided by learnable centers. This proposal is significant in the context of bypassing traditional annotation-heavy processes, achieving segmentation through image-text pairs without pixel-level annotations.

Model Architecture and Training

SegCLIP builds on a dual-encoder approach, comprising a text encoder and an image encoder. Unlike previous models that rely on segmentation decoders or proposal masking frameworks, SegCLIP introduces a semantic group module within the image encoder. This module utilizes a series of cross-attention layers to dynamically allocate learnable centers that aggregate patches into broader semantic regions, effectively transforming pixel-level representations into structured segments. This approach aligns with the CLIP training paradigm but extends its utility to pixel-level tasks.

The training of SegCLIP involves multiple loss functions tailored to enhance visual representation. It integrates a reconstruction loss akin to methodologies observed in Masked Autoencoders (MAE), designed to recover masked patches and strengthen the contextual integrity of visual features. Additionally, a superpixel-based KL divergence loss ensures coherence in the mapping matrix by encouraging consistency within pixel-level superpixel regions derived from unsupervised segmentation. These auxiliary losses are stacked with a traditional contrastive loss to train the model on datasets like Conceptual Captions and COCO, showcasing strong segmentation capabilities without direct segmentation labels.

Experimental Validation

The experimental evaluations show SegCLIP's competency on standard semantic segmentation datasets such as PASCAL VOC 2012, PASCAL Context, and COCO. SegCLIP demonstrates improvements over baseline methods, yielding gains of up to +0.3\% on the VOC dataset and notable performance surges of +2.3\% on Context and +2.2\% on COCO. Particularly, the alignment feature of the model ensures adaptation across arbitrary categories, offering a versatile framework for diverse segmentation tasks beyond curated datasets.

Implications and Future Directions

Practically, SegCLIP's approach removes the dependency on expansive labeled datasets, moving towards scalable, label-efficient semantic segmentation methods powered by text-image pair pre-training. This efficiency opens pathways for deploying semantic segmentation across various domains where obtaining labeled data is challenging. Theoretically, SegCLIP underscores the possible trajectory of AI models to transform class-based tasks into open-domain problems, integrating visual understanding with LLMs.

Future explorations could improve SegCLIP by reducing patch sizes, which can lead to more precise segment boundaries. Additionally, fostering end-to-end training mechanisms and harnessing expansive datasets through post-pretraining can further potentiate SegCLIP’s capabilities. This work contributes significantly to ongoing research in vision-LLMs, offering a scalable architecture that balances theoretical innovation with practical applicability.