Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Collaborating Foundation Models for Domain Generalized Semantic Segmentation (2312.09788v2)

Published 15 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively. The code is available at : https://github.com/yasserben/CLOUDS

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018.
  5. Semantic segment anything. https://github.com/fudan-zvg/Semantic-Segment-Anything, 2023.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, (4):834–848, 2017.
  7. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  8. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  9. Promptstyler: Prompt-driven style generation for source-free domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15702–15712, 2023a.
  10. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023b.
  11. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In CVPR, pages 11580–11590, 2021.
  12. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
  13. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  14. Maskclip: Masked self-distillation advances contrastive language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10995–11005, 2023.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Poda: Prompt-driven zero-shot domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18623–18633, 2023.
  17. Deep learning universal crater detection using segment anything model (sam). arXiv preprint arXiv:2304.07764, 2023.
  18. Prompting diffusion representations for cross-domain semantic segmentation. arXiv preprint arXiv:2307.02138, 2023.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Accuracy of segment-anything model (sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324, 2023.
  21. Percolation and cluster distribution. i. cluster multiple labeling technique and critical concentration algorithm. Physical Review B, 14:3438–3445, 1976.
  22. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9924–9935, 2022a.
  23. Hrda: Context-aware high-resolution domain-adaptive semantic segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX, pages 372–391. Springer, 2022b.
  24. Domain adaptive and generalizable network architectures and training strategies for semantic image segmentation. T-PAMI, 2023.
  25. Fsdr: Frequency space domain randomization for domain generalization. In CVPR, pages 6891–6902, 2021.
  26. Iterative normalization: Beyond standardization towards efficient whitening. In CVPR, pages 4874–4883, 2019.
  27. Openclip. Zenodo, 4:5, 2021.
  28. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  29. Domain generalization via balancing training difficulty and model capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18993–19003, 2023.
  30. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  31. Texture learning domain randomization for domain generalized segmentation. arXiv preprint arXiv:2303.11546, 2023.
  32. A method for stochastic optimization. In International conference on learning representations (ICLR), page 6. San Diego, California;, 2015.
  33. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  34. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18062–18071, 2022.
  35. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  36. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  37. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  38. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  39. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  40. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  41. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2507–2516, 2019.
  42. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  43. Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15579–15591, 2023.
  44. Few-shot adversarial domain adaptation. Advances in neural information processing systems, 30, 2017.
  45. Domain generalization via invariant feature representation. In International conference on machine learning, pages 10–18. PMLR, 2013.
  46. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, pages 4990–4999, 2017.
  47. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
  48. Switchable whitening for deep representation learning. In ICCV, pages 1863–1871, 2019.
  49. Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE TIP, pages 6594–6608, 2021.
  50. Semantic-aware domain generalized segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2594–2605, 2022.
  51. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701, 2023.
  52. Learning to learn single domain generalization. In CVPR, pages 12556–12565, 2020.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  54. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  55. Playing for data: Ground truth from computer games. In ECCV, pages 102–118. Springer, 2016.
  56. Compvis/stable-diffusion-v1-4. https://huggingface.co/CompVis/stable-diffusion-v1-4, 2023. Accessed on March 20, 2023.
  57. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  58. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, pages 3234–3243, 2016.
  59. Adapting visual category models to new domains. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 213–226. Springer, 2010.
  60. Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10765–10775, 2021.
  61. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In CVPR 2023–IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  62. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  63. Selfnorm and crossnorm for out-of-distribution robustness. 2020.
  64. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
  65. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  66. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
  67. Continual adaptation of visual representations via domain randomization and meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4443–4453, 2021.
  68. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  69. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620, 2023.
  70. SegFormer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, pages 12077–12090, 2021.
  71. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  72. A fourier-based framework for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14383–14392, 2021.
  73. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19187–19197, 2023.
  74. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In CVPR, pages 2636–2645, 2020.
  75. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487, 2023.
  76. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In ICCV, pages 2100–2110, 2019.
  77. The unreasonable effectiveness of large language-vision models for source-free video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10307–10317, 2023.
  78. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12414–12424, 2021.
  79. Source-free open compound domain adaptation in semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):7019–7032, 2022a.
  80. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In ECCV, pages 535–552. Springer, 2022b.
  81. Adversarial style augmentation for domain generalized urban-scene segmentation. Advances in Neural Information Processing Systems, 35:338–350, 2022.
  82. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
  83. Confidence regularized self-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5982–5991, 2019.
Citations (5)

Summary

  • The paper introduces the CLOUDS framework that collaboratively leverages CLIP, Diffusion Models, and SAM to address domain generalized semantic segmentation challenges.
  • It employs Diffusion Models for content diversification and SAM for iterative mask refinement, offering a comprehensive alternative to style-only augmentation.
  • Experiments show up to 6.7% mIoU improvement over traditional methods, underscoring the framework's effectiveness for real-world applications.

An Analysis of CLOUDS: Robust Domain Generalized Semantic Segmentation

This essay critically evaluates the paper titled "Collaborating Foundation Models for Domain Generalized Semantic Segmentation," which introduces the CLOUDS framework. This framework addresses the challenge of Domain Generalized Semantic Segmentation (DGSS), aiming to successfully generalize models trained on a labeled source domain to unseen target domains. The significance of DGSS lies in its potential to mitigate performance drops that occur due to domain shifts—a common issue in deep neural networks for semantic segmentation tasks.

The CLOUDS framework diverges from traditional DGSS approaches that predominantly apply Domain Randomization (DR) focusing mainly on style variations. Instead of limiting the augmentation to stylistic changes, CLOUDS leverages a coalition of multiple Foundation Models (FMs) for holistic generalization. The framework integrates the CLIP for feature extraction, Diffusion Models for content diversification, and the Segment Anything Model (SAM) for iterative refinement of segmentation outputs.

Core Components and Methodology

  1. CLIP Backbone: The framework uses CLIP, a contrastively trained visual-LLM, as a backbone to capitalize on its robust feature representation capabilities. This choice recognizes CLIP's strengths in capturing generalized features that are beneficial for unseen domains.
  2. Diffusion Model for Content Diversification: Recognizing the limitations of style-based diversification alone, CLOUDS employs Diffusion Models to generate variations in image content. Textual prompts generated by LLMs guide the diffusion process, enhancing the diversity of synthetic datasets.
  3. SAM for Refinement: The SAM model is utilized to refine segmentation predictions iteratively. It processes outputs from the segmentation model to improve class-agnostic mask predictions through geometric prompts.

Empirical Results

The effectiveness of CLOUDS is illustrated through extensive experiments on DGSS benchmarks, particularly transitioning from synthetic to real data and across varying environmental conditions. CLOUDS demonstrated superior performance, notably outperforming existing methods by 5.6% and 6.7% in average mean Intersection over Union (mIoU).

The results suggest that CLOUDS effectively utilizes synthetic source data for robust training, showcasing marked improvements over both traditional DGSS methods and open-vocabulary segmentation models like FC-CLIP. This advancement points to the efficacy of collaborative Foundation Models in enhancing semantic segmentation across domains.

Implications and Future Directions

CLOUDS exemplifies a significant step forward in domain generalization for semantic segmentation by integrating and exploiting the strengths of multiple FMs. This composition not only indicates a methodological advancement in DGSS but also suggests a paradigm where hybrid models may address broader challenges posed by varying domain characteristics.

Practically, the CLOUDS framework holds potential applications in scenarios requiring domain-agnostic semantic segmentation, such as autonomous driving, where models must adapt to diverse real-world landscapes and conditions without prior exposure. Furthermore, the integration of multiple FMs, as demonstrated, could inspire broader applications and implementations in AI domains requiring robustness against domain shifts.

Future developments could explore enhancing model efficiency and scalability, especially considering the computational resources demanded by integrating multiple Foundation Models. Additionally, expanding the scope of training datasets or leveraging real-time data could further strengthen the model's applicability and precision in dynamically changing environments.

In summary, the CLOUDS framework advances the field of semantic segmentation by effectively leveraging multiple Foundation Models to achieve domain generalization. This approach broadens the horizon for future research and potential applications, addressing core challenges associated with domain shifts in practical scenarios.

X Twitter Logo Streamline Icon: https://streamlinehq.com