Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance (2311.16241v1)

Published 27 Nov 2023 in cs.CV

Abstract: In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-LLMs (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In International Joint Conference on Neural Networks, pages 1–8. IEEE, 2020.
  2. Learning with pseudo-ensembles. NeurIPS, 27, 2014.
  3. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
  4. Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
  5. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, pages 11165–11174, 2023.
  6. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
  7. Semi-supervised semantic segmentation with cross pseudo supervision. In CVPR, pages 2613–2622, 2021.
  8. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
  9. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023.
  10. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  11. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
  12. Decoupling zero-shot semantic segmentation. In CVPR, pages 11583–11592, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
  15. Dmt: Dynamic mutual training for semi-supervised learning. PR, 130:108777, 2022.
  16. Semi-supervised semantic segmentation needs strong, varied perturbations. In BMVC, 2020.
  17. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022.
  18. Generative adversarial nets. NeurIPS, 27, 2014.
  19. Semi-supervised learning by entropy minimization. NeurIPS, 17, 2004.
  20. Open-vocabulary object detection via vision and language knowledge distillation. ICLR, 2021.
  21. Unbiased subclass regularization for semi-supervised semantic segmentation. In CVPR, pages 9968–9978, 2022.
  22. Re-distributing biased pseudo labels for semi-supervised semantic segmentation: A baseline investigation. In ICCV, pages 6930–6940, 2021.
  23. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
  24. Three ways to improve semantic segmentation with self-supervised depth estimation. In CVPR, pages 11130–11140, 2021.
  25. DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In CVPR, pages 9924–9935, 2022a.
  26. HRDA: Context-aware high-resolution domain-adaptive semantic segmentation. In ECCV, pages 372–391. Springer, 2022b.
  27. Domain adaptive and generalizable network architectures and training strategies for semantic image segmentation. IEEE TPAMI, 2023a.
  28. MIC: Masked image consistency for context-enhanced domain adaptation. In CVPR, pages 11721–11732, 2023b.
  29. Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation. IJCV, pages 1–27, 2023c.
  30. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  31. Semi-supervised semantic segmentation via adaptive equalization learning. NeurIPS, 34:22106–22118, 2021.
  32. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
  33. Introducing language guidance in prompt-based continual learning. In ICCV, pages 11463–11473, 2023.
  34. Semi-supervised semantic segmentation with directional context-aware consistency. In CVPR, pages 1205–1214, 2021.
  35. Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop on challenges in representation learning, page 896. Atlanta, 2013.
  36. Language-driven semantic segmentation. In ICLR, 2022.
  37. Diverse cotraining makes strong semi-supervised segmentor. In ICCV, pages 16055–16067, 2023.
  38. Logic-induced diagnostic reasoning for semi-supervised semantic segmentation. In ICCV, pages 16197–16208, 2023a.
  39. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023b.
  40. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation. In CVPR, pages 15305–15314, 2023.
  41. Decoupled weight decay regularization. In ICLR, 2019.
  42. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, pages 23033–23044. PMLR, 2023.
  43. Enhanced soft label for semi-supervised semantic segmentation. In ICCV, pages 1185–1195, 2023.
  44. Semi-supervised semantic segmentation with high-and low-level consistency. IEEE TPAMI, 43(4):1369–1379, 2019.
  45. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898, 2014.
  46. I2dformer: Learning image to document attention for zero-shot image classification. NeurIPS, 2022.
  47. I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, 2023a.
  48. Silc: Improving vision language pretraining with self-distillation. arXiv preprint arXiv:2310.13355, 2023b.
  49. Classmix: Segmentation-based data augmentation for semi-supervised learning. In WACV, pages 1369–1378, 2021.
  50. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, pages 15691–15701, 2023.
  51. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  52. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  53. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. NeurIPS, 29, 2016.
  54. Reco: Retrieve and co-segment for zero-shot transfer. In NeurIPS, 2022.
  55. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS, 33:596–608, 2020.
  56. Semi supervised semantic segmentation using generative adversarial network. In ICCV, pages 5688–5696, 2017.
  57. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, pages 5227–5237, 2022.
  58. Attention is all you need. NeurIPS, 30, 2017.
  59. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In CVPR, pages 3835–3844, 2022a.
  60. Semi-supervised semantic segmentation using unreliable pseudo-labels. In CVPR, pages 4248–4257, 2022b.
  61. Group normalization. In ECCV, pages 3–19, 2018.
  62. Semi-supervised semantic segmentation with prototype-based consistency regularization. NeurIPS, 35:26007–26020, 2022a.
  63. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022b.
  64. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753. Springer, 2022c.
  65. Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023.
  66. Dash: Semi-supervised learning with dynamic thresholding. In ICML, pages 11525–11536. PMLR, 2021.
  67. St++: Make self-training work better for semi-supervised semantic segmentation. In CVPR, pages 4268–4277, 2022.
  68. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR, pages 7236–7246, 2023.
  69. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NeurIPS, 34:18408–18419, 2021.
  70. Pixel contrastive-consistent semi-supervised semantic segmentation. In ICCV, pages 7273–7282, 2021.
  71. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  72. Extract free dense labels from clip. In ECCV, 2022a.
  73. Extract free dense labels from clip. In ECCV, pages 696–712. Springer, 2022b.
  74. Non-contrastive learning meets language-image pre-training. In CVPR, 2023a.
  75. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022c.
  76. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, pages 11175–11185, 2023b.
  77. Pseudoseg: Designing pseudo labels for semantic segmentation. In ICLR, 2020.
Citations (11)

Summary

  • The paper introduces SemiVL, a framework that combines VLM pretraining with spatial fine-tuning to improve segmentation with limited labels.
  • It employs a language-guided decoder and dense CLIP predictions to enhance spatial reasoning and semantic accuracy.
  • Empirical results show mIoU gains of up to +13.5 on COCO and +9.7 on ADE20K, indicating its potential in resource-constrained settings.

An Academic Review of "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance"

The paper, "SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance," discusses an innovative approach to enhancing semi-supervised semantic segmentation by integrating the rich semantic priors from Vision-LLMs (VLMs). The fundamental challenge addressed in this research is the annotation scarcity in semantic segmentation, which traditionally requires extensive and costly manual labeling. The SemiVL framework is posited as a solution by leveraging both a limited set of labeled images and an extensive collection of unlabeled images, supplemented by vision-language guidance to improve segmentation quality.

Methodological Innovations

SemiVL introduces several methodological contributions:

  1. Vision-LLM Pre-Training: The paper proposes using a VLM, specifically a pre-trained CLIP model, for the initial semantic segmentation model. This approach departs from the traditional ImageNet pre-training by capitalizing on the extensive semantic understanding gained from VLM's training on web-scale image-text datasets.
  2. Spatial Fine-Tuning: Recognizing that VLMs are trained at the image level, the authors implement a spatial fine-tuning strategy that selectively updates parameters in network layers responsible for spatial reasoning. This aims to enhance feature localization necessary for dense semantic segmentation tasks while retaining the rich semantic representations from the pre-training phase.
  3. Language-Guided Decoder: This novel decoder architecture utilizes similarity maps between vision and text embeddings to engage in both spatial and semantic reasoning. Such a design effectively exploits text-guided semantics for segmentation, potentially improving the model's ability to distinguish between visually similar classes and contexts.
  4. Dense CLIP Guidance: The training incorporates predictions from a frozen CLIP model to anchor semi-supervised training on unlabeled images, thus mitigating prediction drift caused by self-confirmation bias inherent in self-training approaches.
  5. Class Definition Guidance: The framework utilizes language-based class definitions, such as those found in dataset-specific annotation guidelines, to guide model interpretations, offering an innovative way to embed domain-specific knowledge into the training process.

Empirical Results

The research demonstrates significant improvements over existing methods across several datasets: Pascal VOC, COCO, ADE20K, and Cityscapes. Notably, SemiVL achieves an improvement of up to +13.5 mIoU on COCO and +9.7 mIoU on ADE20K, highlighting its robust performance under varying levels of annotation availability. These results underscore the advantages of integrating vision-language guidance, especially in scenarios with severely limited labeled data.

Implications and Future Directions

The implications of this research are manifold. Practically, it suggests a pathway toward reducing the dependency on annotated datasets, which could facilitate deploying semantic segmentation models in resource-constrained environments. Theoretically, the work bridges the gap between language-driven and vision-driven approaches, suggesting new avenues for the hybrid integration of multi-modal models in computer vision tasks.

Looking to the future, it's conceivable that further refinement of vision-LLMs could enhance the capabilities of semi-supervised frameworks like SemiVL. For instance, integrating more sophisticated language understanding models or evolving towards end-to-end trainable mixed-modal architectures might delineate new horizons in semantic segmentation and related vision tasks.

Conclusion

The SemiVL framework showcases a well-conceived and experimentally validated approach to semi-supervised semantic segmentation, making a case for more widespread adoption of vision-LLMs in this domain. By capturing and leveraging rich semantic landscapes, SemiVL represents an advancement in improving model performance where labeled data is sparse, thus contributing substantively to the field's ongoing discourse on resource-efficient machine learning techniques.