SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance (2311.16241v1)
Abstract: In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-LLMs (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl
- Pseudo-labeling and confirmation bias in deep semi-supervised learning. In International Joint Conference on Neural Networks, pages 1–8. IEEE, 2020.
- Learning with pseudo-ensembles. NeurIPS, 27, 2014.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, pages 11165–11174, 2023.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
- Semi-supervised semantic segmentation with cross pseudo supervision. In CVPR, pages 2613–2622, 2021.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
- Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- Decoupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
- Dmt: Dynamic mutual training for semi-supervised learning. PR, 130:108777, 2022.
- Semi-supervised semantic segmentation needs strong, varied perturbations. In BMVC, 2020.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022.
- Generative adversarial nets. NeurIPS, 27, 2014.
- Semi-supervised learning by entropy minimization. NeurIPS, 17, 2004.
- Open-vocabulary object detection via vision and language knowledge distillation. ICLR, 2021.
- Unbiased subclass regularization for semi-supervised semantic segmentation. In CVPR, pages 9968–9978, 2022.
- Re-distributing biased pseudo labels for semi-supervised semantic segmentation: A baseline investigation. In ICCV, pages 6930–6940, 2021.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
- Three ways to improve semantic segmentation with self-supervised depth estimation. In CVPR, pages 11130–11140, 2021.
- DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In CVPR, pages 9924–9935, 2022a.
- HRDA: Context-aware high-resolution domain-adaptive semantic segmentation. In ECCV, pages 372–391. Springer, 2022b.
- Domain adaptive and generalizable network architectures and training strategies for semantic image segmentation. IEEE TPAMI, 2023a.
- MIC: Masked image consistency for context-enhanced domain adaptation. In CVPR, pages 11721–11732, 2023b.
- Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. IJCV, pages 1–27, 2023c.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Semi-supervised semantic segmentation via adaptive equalization learning. NeurIPS, 34:22106–22118, 2021.
- Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
- Introducing language guidance in prompt-based continual learning. In ICCV, pages 11463–11473, 2023.
- Semi-supervised semantic segmentation with directional context-aware consistency. In CVPR, pages 1205–1214, 2021.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on challenges in representation learning, page 896. Atlanta, 2013.
- Language-driven semantic segmentation. In ICLR, 2022.
- Diverse cotraining makes strong semi-supervised segmentor. In ICCV, pages 16055–16067, 2023.
- Logic-induced diagnostic reasoning for semi-supervised semantic segmentation. In ICCV, pages 16197–16208, 2023a.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023b.
- Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In CVPR, pages 15305–15314, 2023.
- Decoupled weight decay regularization. In ICLR, 2019.
- Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044. PMLR, 2023.
- Enhanced soft label for semi-supervised semantic segmentation. In ICCV, pages 1185–1195, 2023.
- Semi-supervised semantic segmentation with high-and low-level consistency. IEEE TPAMI, 43(4):1369–1379, 2019.
- The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898, 2014.
- I2dformer: Learning image to document attention for zero-shot image classification. NeurIPS, 2022.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, 2023a.
- Silc: Improving vision language pretraining with self-distillation. arXiv preprint arXiv:2310.13355, 2023b.
- Classmix: Segmentation-based data augmentation for semi-supervised learning. In WACV, pages 1369–1378, 2021.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, pages 15691–15701, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- Regularization with stochastic transformations and perturbations for deep semi-supervised learning. NeurIPS, 29, 2016.
- Reco: Retrieve and co-segment for zero-shot transfer. In NeurIPS, 2022.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS, 33:596–608, 2020.
- Semi supervised semantic segmentation using generative adversarial network. In ICCV, pages 5688–5696, 2017.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, pages 5227–5237, 2022.
- Attention is all you need. NeurIPS, 30, 2017.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022a.
- Semi-supervised semantic segmentation using unreliable pseudo-labels. In CVPR, pages 4248–4257, 2022b.
- Group normalization. In ECCV, pages 3–19, 2018.
- Semi-supervised semantic segmentation with prototype-based consistency regularization. NeurIPS, 35:26007–26020, 2022a.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022b.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753. Springer, 2022c.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023.
- Dash: Semi-supervised learning with dynamic thresholding. In ICML, pages 11525–11536. PMLR, 2021.
- St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, pages 4268–4277, 2022.
- Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR, pages 7236–7246, 2023.
- Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NeurIPS, 34:18408–18419, 2021.
- Pixel contrastive-consistent semi-supervised semantic segmentation. In ICCV, pages 7273–7282, 2021.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Extract free dense labels from clip. In ECCV, 2022a.
- Extract free dense labels from clip. In ECCV, pages 696–712. Springer, 2022b.
- Non-contrastive learning meets language-image pre-training. In CVPR, 2023a.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022c.
- Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, pages 11175–11185, 2023b.
- Pseudoseg: Designing pseudo labels for semantic segmentation. In ICLR, 2020.