Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation (2312.04265v5)

Published 7 Dec 2023 in cs.CV

Abstract: In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing any real urban-scene datasets.Code is available at https://github.com/w1oves/Rein.git.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. On the opportunities and risks of foundation models, 2022.
  2. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  4. Pasta: Proportional amplitude spectrum training augmentation for syn-to-real domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19288–19300, 2023.
  5. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
  6. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  7. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  8. Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11580–11590, 2021.
  9. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  10. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  12. Hgformer: Hierarchical grouping transformer for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15413–15423, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023a.
  15. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023b.
  16. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
  17. The elements of statistical learning: data mining, inference, and prediction. Springer, 2009.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  21. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  22. Style projected clustering for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3061–3071, 2023.
  23. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  24. Domain generalization via balancing training difficulty and model capability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18993–19003, 2023.
  25. Order-preserving consistency regularization for domain adaptation and generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18916–18927, 2023.
  26. Style neophile: Constantly seeking novel styles for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7130–7140, 2022.
  27. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  28. Single domain generalization for lidar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17587–17598, 2023a.
  29. Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4350–4360, 2022.
  30. Wedge: web-image assisted domain generalization for semantic segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9281–9288. IEEE, 2023b.
  31. Texture learning domain randomization for domain generalized segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 677–687, 2023c.
  32. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
  33. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
  34. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  35. Wildnet: Learning domain generalized semantic segmentation from the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9936–9946, 2022.
  36. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  37. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3041–3050, 2023.
  38. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  39. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  40. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
  41. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  42. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  44. Mask-based panoptic lidar segmentation for autonomous driving. IEEE Robotics and Automation Letters, 8(2):1141–1148, 2023.
  45. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, pages 4990–4999, 2017.
  46. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  47. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
  48. Switchable whitening for deep representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1863–1871, 2019.
  49. Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Transactions on Image Processing, 30:6594–6608, 2021.
  50. Semantic-aware domain generalized segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2594–2605, 2022.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  52. Master of all: Simultaneous generalization of urban-scene segmentation to all adverse weather conditions. In European Conference on Computer Vision, pages 51–69. Springer, 2022.
  53. Playing for data: Ground truth from computer games. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 102–118. Springer, 2016.
  54. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  55. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  56. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  57. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  58. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  59. Crossnorm and selfnorm for generalization under distribution shifts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 52–61, 2021.
  60. A re-parameterized vision transformer (revt) for domain-generalized semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4376–4385, 2023.
  61. Dynamically instance-guided adaptation: A backward-free approach for test-time domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24090–24099, 2023.
  62. Siamdoge: Domain generalizable semantic segmentation using siamese network. In European Conference on Computer Vision, pages 603–620. Springer, 2022.
  63. Dirl: Domain-invariant representation learning for generalizable semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2884–2892, 2022.
  64. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  65. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2100–2110, 2019.
  66. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  67. Segvit: Semantic segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35:4971–4982, 2022.
  68. Learning shape-invariant representation for generalizable semantic segmentation. IEEE Transactions on Image Processing, 2023.
  69. Adversarial style augmentation for domain generalized urban-scene segmentation. Advances in Neural Information Processing Systems, 35:338–350, 2022.
  70. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  71. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
Citations (22)

Summary

  • The paper introduces Rein, a parameter-efficient fine-tuning strategy that adapts vision foundation models to enhance segmentation across unseen domains.
  • The study benchmarks several VFMs, including CLIP, MAE, SAM, EVA02, and DINOv2, demonstrating superior generalization with fewer trainable parameters.
  • Experimental results on datasets like GTAV→Cityscapes, BDD100K, and Mapillary confirm notable mIoU gains and effective trade-offs between model complexity and performance.

Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

The paper presents a novel approach to tackling the challenges of Domain Generalized Semantic Segmentation (DGSS) by leveraging Vision Foundation Models (VFMs). Unlike prior models that often rely on outdated backbones such as ResNet or VGGNet, this work demonstrates the robust potential of VFMs in domain generalization tasks, showcasing notable improvements in model performance with fewer trainable parameters.

Overview and Methodological Contribution

Initially, the authors assess VFMs in DGSS settings to establish baselines. They explore VFMs, including CLIP, MAE, SAM, EVA02, and DINOv2, exhibiting their superior performance compared to existing DGSS methods. These VFMs are originally pretrained on diverse and large-scale datasets, contributing to their effectiveness in generalizing across unseen domains, a characteristic that is especially beneficial for tasks like semantic segmentation in urban scenes.

Central to the proposed approach is the "Rein" method, an innovative parameter-efficient fine-tuning strategy built upon lightweight learnable tokens. Rein modifies feature maps at each backbone layer, refining and forwarding these features to subsequent layers with precision. By employing fewer trainable parameters, Rein surpasses the performance achieved by full parameter fine-tuning across various datasets, demonstrating a remarkable trade-off between model complexity and generalization ability.

Experimental Results and Analysis

The paper provides extensive empirical validation across multiple datasets and settings. In a primary experimental setting (GTAV \rightarrow Cityscapes + BDD100K + Mapillary), the Rein method achieves notable mIoU improvements, surpassing existing methods by significant margins with fewer parameter overheads. These findings illustrate that VFMs, when adapted using the Rein strategy, are not only capable of achieving high generalization but are also efficient in terms of parameter tuning.

Additionally, the paper evaluates the impact of different token lengths and ranks in the Rein method, offering insights into the optimal settings for balancing parameter count and performance. Across various VFMs, Rein exhibits consistent superiority, reinforcing its adaptability and efficacy in DGSS tasks.

Implications and Future Research

The implications of this paper are multifold. On a practical level, the paper demonstrates a promising direction for deploying VFMs in scenarios where data diversity is paramount, such as autonomous driving and real-time urban scene understanding. Theoretically, it contributes to the ongoing discourse on efficient model adaptation and generalization in machine learning.

Moving forward, the integration of Rein with other foundational tasks, like instance and panoptic segmentation, presents a rich avenue for future research. Further exploration could also investigate Rein's effectiveness under diverse and challenging conditions, such as adverse weather or nighttime scenarios. The paper sets a robust foundation for more versatile and scalable segmentation systems, potentially bridging gaps between synthetic training environments and real-world application domains.

X Twitter Logo Streamline Icon: https://streamlinehq.com