Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation (2307.00371v5)

Published 1 Jul 2023 in cs.CV

Abstract: Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. To achieve this, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, as lower-resolution image features usually contain more robust content information and are less sensitive to style variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods for domain-generalized semantic segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean intersection over union). The source code is publicly available at \url{https://github.com/BiQiWHU/CMFormer}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Interactive Learning of Intrinsic and Extrinsic Properties for All-Day Semantic Segmentation. IEEE Transactions on Image Processing, 32: 3821–3835.
  2. Learning to balance specificity and invariance for in and out of domain generalization. In European Conference on Computer Vision, 301–318. Springer.
  3. Learning Multiple Adverse Weather Removal via Two-Stage Knowledge Learning and Multi-Contrastive Regularization: Toward a Unified Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17653–17662.
  4. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1290–1299.
  5. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34: 17864–17875.
  6. RobustNet: Improving Domain Generalization in Urban-Scene Segmentation via Instance Selective Whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11580–11590.
  7. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3213–3223.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
  9. HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15413–15423.
  10. Feature Representation Learning for Unsupervised Cross-Domain Image Retrieval. In European Conference on Computer Vision, 529–544. Springer.
  11. Iterative Normalization: Beyond Standardization towards Efficient Whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4874–4883.
  12. Iterative normalization: Beyond standardization towards efficient whitening. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 4874–4883.
  13. Style Projected Clustering for Domain Generalized Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3061–3071.
  14. Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection. In International Conference on Learning Representations.
  15. Learning calibrated medical image segmentation via multi-rater agreement modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12341–12351.
  16. Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4350–4360.
  17. MSeg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2879–2888.
  18. WildNet: Learning Domain Generalized Semantic Segmentation from the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9936–9946.
  19. Joint semantic mining for weakly supervised RGB-D salient object detection. Advances in Neural Information Processing Systems, 34: 11945–11959.
  20. Evaluating model performance under worst-case subpopulations. Advances in Neural Information Processing Systems, 34: 17325–17334.
  21. Intra-Source Style Augmentation for Improved Domain Generalization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 509–519.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 10012–10022.
  23. Domain generalization using causal matching. In International Conference on Machine Learning, 7313–7324. PMLR.
  24. An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3001–3011.
  25. The mapillary vistas dataset for semantic understanding of street scenes. In Proceedings of the IEEE international conference on computer vision, 4990–4999.
  26. Label-efficient hybrid-supervised learning for medical image segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2026–2034.
  27. Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net. In Proceedings of the European Conference on Computer Vision (ECCV), 464–479.
  28. Switchable Whitening for Deep Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1863–1871.
  29. Semantic-aware domain generalized segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2594–2605.
  30. Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Transactions on Image Processing, 30: 6594–6608.
  31. Out-of-Domain Generalization From a Single Source: An Uncertainty Quantification Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  32. Empirical Generalization Study: Unsupervised Domain Adaptation vs. Domain Generalization Methods for Semantic Segmentation in the Wild. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 499–508.
  33. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12556–12565.
  34. Playing for data: Ground truth from computer games. In European conference on computer vision, 102–118. Springer.
  35. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3234–3243.
  36. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10765–10775.
  37. Batch normalization embeddings for deep domain generalization. Pattern Recognition, 135: 109115.
  38. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 7262–7272.
  39. Adversarial semantic hallucination for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 318–327.
  40. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31.
  41. Learning from extrinsic and intrinsic supervisions for domain generalization. In European Conference on Computer Vision, 159–176. Springer.
  42. DIRL: Domain-invariant representation learning for generalizable semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2884–2892.
  43. Temporal cue guided video highlight detection with low-rank audio-visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7950–7959.
  44. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687, 2(5): 6.
  45. Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2100–2110.
  46. Domain generalization via entropy regularization. Advances in Neural Information Processing Systems, 33: 16096–16107.
  47. Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, 535–552. Springer.
  48. Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation. In Advances in Neural Information Processing Systems.
  49. Differential convolution feature guided deep multi-scale multiple instance learning for aerial scene classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, 4595–4599.
  50. Learning to generate novel domains for domain generalization. In European conference on computer vision, 561–578. Springer.
  51. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.
Citations (23)

Summary

We haven't generated a summary for this paper yet.