SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation (2211.14813v2)
Abstract: Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% mIoU), PASCAL Context (+2.3% mIoU), and COCO (+2.2% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pp. 1708–1718, 2021.
- Zero-shot semantic segmentation. In NeurIPS, 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, pp. 9630–9640, 2021.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pp. 3558–3568, 2021.
- Transformer interpretability beyond attention visualization. In CVPR, pp. 782–791, 2021.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- An empirical study of training self-supervised vision transformers. In ICCV, pp. 9620–9629, 2021.
- UNITER: universal image-text representation learning. In ECCV, volume 12375, pp. 104–120, 2020.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pp. 17864–17875, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, pp. 1290–1299, 2022.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223, 2016.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186, 2019.
- Decoupling zero-shot semantic segmentation. In CVPR, pp. 11573–11582, 2022a.
- Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022b.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010.
- Efficient graph-based image segmentation. International journal of computer vision, 59(2):167–181, 2004.
- Vision-language pre-training: Basics, recent advances, and future trends. arXiv preprint arXiv:2210.09263, 2022.
- Scaling open-vocabulary image segmentation with image-level labels. arXiv:2112.12143, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Masked autoencoders are scalable vision learners. In CVPR, pp. 15979–15988, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
- Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:abs/2211.06220, 2022.
- Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, volume 139, pp. 4904–4916, 2021.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, volume 139, pp. 5583–5594, 2021.
- Language-driven semantic segmentation. In ICLR, 2022a.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, volume 162, pp. 12888–12900, 2022b.
- LAVENDER: unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022c.
- Grounded language-image pre-training. In CVPR, pp. 10955–10965, 2022d.
- UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In ACL/IJCNLP, pp. 2592–2607, 2021.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In ICLR, 2022e.
- Open-vocabulary semantic segmentation with mask-adapted CLIP. arXiv preprint arXiv:abs/2210.04150, 2022.
- Microsoft COCO: common objects in context. In ECCV, volume 8693, pp. 740–755, 2014.
- Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440, 2015.
- Image segmentation using text and image prompts. In CVPR, pp. 7076–7086, 2022.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:abs/2210.15138, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pp. 2630–2640, 2019.
- The role of context for object detection and semantic segmentation in the wild. In CVPR, pp. 891–898, 2014.
- Learning transferable visual models from natural language supervision. In ICML, volume 139, pp. 8748–8763, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pp. 18061–18070, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241, 2015.
- LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pp. 618–626, 2017.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. 2556–2565, 2018.
- Videobert: A joint model for video and language representation learning. In ICCV, pp. 7463–7472, 2019.
- LXMERT: learning cross-modality encoder representations from transformers. In EMNLP, 2019.
- YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, 2016.
- Training data-efficient image transformers & distillation through attention. In ICML, volume 139, pp. 10347–10357, 2021.
- Attention is all you need. In NeurIPS, pp. 5998–6008, 2017.
- Bevt: Bert pretraining of video transformers. In CVPR, pp. 14733–14743, 2022a.
- Simvlm: Simple visual language model pretraining with weak supervision. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022b.
- Self-supervised visual representation learning with semantic grouping. In NeurIPS, 2022.
- Phrasecut: Language-based image segmentation in the wild. In CVPR, pp. 10213–10222, 2020.
- Semantic projection network for zero- and few-label semantic segmentation. In CVPR, pp. 8256–8265, 2019.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, pp. 12077–12090, 2021.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pp. 18134–18144, 2022a.
- A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. ECCV, 2022b.
- FILIP: fine-grained interactive language-image pre-training. In ICLR, 2022.
- Semantic segmentation in-the-wild without seeing any segmentation examples. arXiv:2112.03185, 2021.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. In ICML, volume 162, pp. 25994–26009, 2022.
- Pyramid scene parsing network. In CVPR, pp. 6230–6239, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
- Regionclip: Region-based language-image pretraining. In CVPR, pp. 16772–16782, 2022.
- Scene parsing through ade20k dataset. In CVPR, pp. 633–641, 2017.
- Extract free dense labels from clip. In ECCV, 2022a.
- iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2022b.
- Huaishao Luo (12 papers)
- Junwei Bao (34 papers)
- Youzheng Wu (32 papers)
- Xiaodong He (162 papers)
- Tianrui Li (86 papers)