Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Domain Adaptive and Generalizable Network Architectures and Training Strategies for Semantic Image Segmentation (2304.13615v2)

Published 26 Apr 2023 in cs.CV

Abstract: Unsupervised domain adaptation (UDA) and domain generalization (DG) enable machine learning models trained on a source domain to perform well on unlabeled or even unseen target domains. As previous UDA&DG semantic segmentation methods are mostly based on outdated networks, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network tailored for UDA&DG. It is enabled by three training strategies to avoid overfitting to the source domain: While (1) Rare Class Sampling mitigates the bias toward common source domain classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. As UDA&DG are usually GPU memory intensive, most previous methods downscale or crop images. However, low-resolution predictions often fail to preserve fine details while models trained with cropped images fall short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution framework for UDA&DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention. DAFormer and HRDA significantly improve the state-of-the-art UDA&DG by more than 10 mIoU on 5 different benchmarks. The implementation is available at https://github.com/lhoyer/HRDA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016, pp. 3213–3223.
  2. C. Sakaridis, D. Dai, and L. Van Gool, “ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding,” in ICCV, 2021, pp. 10 765–10 775.
  3. G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in CVPR, 2016, pp. 3234–3243.
  4. S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in ECCV, 2016, pp. 102–118.
  5. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” PAMI, vol. 40, no. 4, pp. 834–848, 2017.
  6. Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” in CVPR, 2018, pp. 7472–7481.
  7. Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in ECCV, 2020, pp. 173–190.
  8. A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.
  9. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
  10. E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in NeurIPS, 2021.
  11. S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding robustness of transformers for image classification,” in ICCV, 2021, pp. 10 231–10 241.
  12. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  13. W. Tranheden, V. Olsson, J. Pinto, and L. Svensson, “DACS: Domain Adaptation via Cross-domain Mixed Sampling,” in WACV, 2021, pp. 1379–1389.
  14. N. Araslanov and S. Roth, “Self-supervised augmentation consistency for adapting semantic segmentation,” in CVPR, 2021, pp. 15 384–15 394.
  15. Q. Wang, D. Dai, L. Hoyer, O. Fink, and L. Van Gool, “Domain adaptive semantic segmentation with self-supervised depth estimation,” in ICCV, 2021, pp. 8515–8525.
  16. J. Huang, S. Lu, D. Guan, and X. Zhang, “Contextual-relation consistent domain adaptation for semantic segmentation,” in ECCV, 2020, pp. 705–722.
  17. J. Yang, W. An, C. Yan, P. Zhao, and J. Huang, “Context-aware domain adaptation in semantic segmentation,” in WACV, 2021, pp. 514–524.
  18. L. Hoyer, D. Dai, and L. Van Gool, “DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation,” in CVPR, 2022.
  19. ——, “HRDA: Context-aware high-resolution domain-adaptive semantic segmentation,” in ECCV, 2022.
  20. P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen, “Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation,” in CVPR, 2021, pp. 12 414–12 424.
  21. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
  22. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
  23. H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in CVPR, 2018, pp. 7151–7160.
  24. L. Hoyer, M. Munoz, P. Katiyar, A. Khoreva, and V. Fischer, “Grid saliency for context explanations of semantic segmentation,” in NeurIPS, 2019, pp. 6462–6473.
  25. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
  26. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in CVPR, 2019, pp. 3146–3154.
  27. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
  28. D. Lin, D. Shen, S. Shen, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Zigzagnet: Fusing top-down and bottom-up context for object segmentation,” in CVPR, 2019, pp. 7490–7499.
  29. L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in CVPR, 2016, pp. 3640–3649.
  30. S. Yang and G. Peng, “Attention to refine through multi scales for semantic segmentation,” in Pacific Rim Conference on Multimedia, 2018, pp. 232–241.
  31. D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in ICLR, 2019.
  32. D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349.
  33. M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Intriguing Properties of Vision Transformers,” in NeurIPS, 2021.
  34. R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” in ICLR, 2018.
  35. C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models with respect to common corruptions,” IJCV, vol. 129, no. 2, pp. 462–483, 2021.
  36. X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning and generalization capacities via ibn-net,” in ECCV, 2018, pp. 464–479.
  37. S. Choi, S. Jung, H. Yun, J. T. Kim, S. Kim, and J. Choo, “Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening,” in CVPR, 2021, pp. 11 580–11 590.
  38. X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong, “Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data,” in ICCV, 2019, pp. 2100–2110.
  39. D. Peng, Y. Lei, L. Liu, P. Zhang, and J. Liu, “Global and local texture randomization for synthetic-to-real semantic segmentation,” TIP, vol. 30, pp. 6594–6608, 2021.
  40. Y. Zhao, Z. Zhong, N. Zhao, N. Sebe, and G. H. Lee, “Style-hallucinated dual consistency learning for domain generalized semantic segmentation,” in ECCV, 2022.
  41. Z. Zhong, Y. Zhao, G. H. Lee, and N. Sebe, “Adversarial style augmentation for domain generalized urban-scene segmentation,” in NeurIPS, 2022.
  42. Y. Zhao, Z. Zhong, N. Zhao, N. Sebe, and G. H. Lee, “Style-hallucinated dual consistency learning: A unified framework for visual domain generalization,” arXiv preprint arXiv:2212.09068, 2022.
  43. J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in ICML, 2018, pp. 1989–1998.
  44. R. Gong, W. Li, Y. Chen, D. Dai, and L. Van Gool, “Dlow: Domain flow and applications,” IJCV, vol. 129, no. 10, pp. 2865–2888, 2021.
  45. J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial and constraint-based adaptation,” arXiv preprint arXiv:1612.02649, 2016.
  46. T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in CVPR, 2019, pp. 2517–2526.
  47. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, 2014, pp. 2672–2680.
  48. Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in ECCV, 2018, pp. 289–305.
  49. D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool, “Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding,” IJCV, vol. 128, no. 5, pp. 1182–1204, 2020.
  50. D. Dai and L. Van Gool, “Dark model adaptation: Semantic image segmentation from daytime to nighttime,” in ITSC, 2018, pp. 3819–3824.
  51. L. Hoyer, D. Dai, H. Wang, and L. Van Gool, “MIC: Masked image consistency for context-enhanced domain adaptation,” in CVPR, 2023.
  52. Q. Zhou, Z. Feng, Q. Gu, J. Pang, G. Cheng, X. Lu, J. Shi, and L. Ma, “Context-aware mixup for domain adaptive semantic segmentation,” in WACV, 2021, pp. 514–524.
  53. L. Hoyer, D. Dai, Q. Wang, Y. Chen, and L. Van Gool, “Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation,” arXiv preprint arXiv:2108.12545, 2021.
  54. Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” in NeurIPS, 2017, pp. 7032–7042.
  55. V. Prabhu, S. Khare, D. Kartik, and J. Hoffman, “Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation,” in ICCV, 2021, pp. 8558–8567.
  56. Z. Li and D. Hoiem, “Learning without forgetting,” PAMI, vol. 40, no. 12, pp. 2935–2947, 2017.
  57. L. Hoyer, D. Dai, Y. Chen, A. Köring, S. Saha, and L. Van Gool, “Three ways to improve semantic segmentation with self-supervised depth estimation,” in CVPR, 2021, pp. 11 130–11 140.
  58. Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic segmentation of urban scenes,” in CVPR, 2018, pp. 7892–7901.
  59. H. Wang, T. Shen, W. Zhang, L.-Y. Duan, and T. Mei, “Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation,” in ECCV, 2020, pp. 642–659.
  60. M. N. Subhani and M. Ali, “Learning from scale-invariant examples for domain adaptation in semantic segmentation,” in ECCV, 2020, pp. 290–306.
  61. J. Iqbal and M. Ali, “Mlsl: Multi-level self-supervised learning for domain adaptation with spatially independent and semantically consistent labeling,” in WACV, 2020, pp. 1864–1873.
  62. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017, pp. 1501–1510.
  63. Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized self-training,” in ICCV, 2019, pp. 5982–5991.
  64. A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017, pp. 1195–1204.
  65. V. Olsson, W. Tranheden, J. Pinto, and L. Svensson, “Classmix: Segmentation-based data augmentation for semi-supervised learning,” in WACV, 2021, pp. 1369–1378.
  66. S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
  67. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  68. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in ICLR, 2019.
  69. X. Lai, Z. Tian, X. Xu, Y. Chen, S. Liu, H. Zhao, L. Wang, and J. Jia, “Decouplenet: Decoupled network for domain adaptive semantic segmentation,” in ECCV, 2022, pp. 369–387.
  70. C. Sakaridis, D. Dai, and L. V. Gool, “Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation,” in ICCV, 2019, pp. 7374–7383.
  71. C. Sakaridis, D. Dai, and L. Van Gool, “Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation,” PAMI, 2020.
  72. X. Wu, Z. Wu, H. Guo, L. Ju, and S. Wang, “Dannet: A one-stage domain adaptation network for unsupervised nighttime semantic segmentation,” in CVPR, 2021, pp. 15 769–15 778.
  73. F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in CVPR, 2020, pp. 2636–2645.
  74. G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in ICCV, 2017, pp. 4990–4999.
  75. M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
  76. J. Huang, D. Guan, A. Xiao, and S. Lu, “Fsdr: Frequency space domain randomization for domain generalization,” in CVPR, 2021, pp. 6891–6902.
  77. D. Peng, Y. Lei, M. Hayat, Y. Guo, and W. Li, “Semantic-aware domain generalized segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2594–2605.
  78. D. Bashkirova, S. Mishra, D. Lteif, P. Teterwak, D. Kim, F. Alladkani, J. Akl, B. Calli, S. A. Bargal, K. Saenko et al., “Visda 2022 challenge: Domain adaptation for industrial waste sorting,” arXiv preprint arXiv:2303.14828, 2023.
  79. S. Saha, L. Hoyer, A. Obukhov, D. Dai, and L. Van Gool, “EDAPS: Enhanced domain-adaptive panoptic segmentation,” in ICCV, 2023.
  80. J. Xia, N. Yokoya, B. Adriano, and C. Broni-Bediako, “Openearthmap: A benchmark dataset for global high-resolution land cover mapping,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6254–6264.
  81. P. P. Rao, F. Qiao, W. Zhang, Y. Xu, Y. Deng, G. Wu, and Q. Zhang, “Quadformer: Quadruple transformer for unsupervised domain adaptation in power line segmentation of aerial images,” arXiv preprint arXiv:2211.16988, 2022.
  82. L. Huang, Y. Yuan, J. Guo, C. Zhang, X. Chen, and J. Wang, “Interlaced sparse self-attention for semantic segmentation,” arXiv preprint arXiv:1907.12273, 2019.
  83. H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha et al., “Resnest: Split-attention networks,” arXiv preprint arXiv:2004.08955, 2020.
  84. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in ECCV, 2018, pp. 418–434.
Citations (20)

Summary

  • The paper presents DAFormer and HRDA, novel architectures that leverage multi-scale feature fusion to enhance domain adaptation in semantic segmentation.
  • It utilizes transformer-based context fusion and a multi-resolution framework to significantly improve mIoU scores, including a +16.3 boost on GTA→Cityscapes.
  • Incorporating strategies such as Rare Class Sampling and learning rate warmup, the work advances robust, domain-agnostic training for real-world applications.

Domain Adaptive and Generalizable Network Architectures for Semantic Image Segmentation

The paper, "Domain Adaptive and Generalizable Network Architectures and Training Strategies for Semantic Image Segmentation," presents a comprehensive approach to tackle the challenges of domain adaptation and generalization in semantic image segmentation. These tasks are pivotal in enabling models to effectively perform in environments or domains that are unseen or unlabeled, an area of immense interest given its potential applications in fields such as autonomous driving.

Key Contributions

The authors introduce two main contributions: DAFormer and HRDA, which are network architectures and training strategies designed to enhance domain robustness.

  1. DAFormer Network Architecture: This architecture uses a Transformer-based approach, leveraging recent advancements in this area. Transformers are known for their robustness and ability to generalize across domains more efficiently than traditional CNNs. DAFormer incorporates context-aware feature fusion, utilizing multi-level features to improve domain adaptation and generalization performance. This fusion is particularly crucial given the structure of semantic segmentation tasks, which require both detailed and contextual understanding of image content.
  2. HRDA Multi-resolution Framework: HRDA addresses the memory-intensive nature of unsupervised domain adaptation (UDA) by proposing a multi-resolution approach. It combines small high-resolution crops to capture fine segmentation details and large low-resolution crops to leverage long-range context, utilizing a learned scale attention to adaptively fuse these predictions.

Training Strategies and Regularization

The paper also discusses training strategies that mitigate overfitting, a common challenge when models are trained extensively on source domains. These strategies include:

  • Rare Class Sampling (RCS): This approach addresses class imbalances by frequently sampling images containing rare classes, thereby promoting the learning of robust domain-invariant representations.
  • Thing-Class ImageNet Feature Distance (FD): Leveraging ImageNet pretraining, this feature distillation helps maintain expressive features from pretraining, particularly for thing-classes, which are typically underrepresented in the synthetic training datasets.
  • Learning Rate Warmup: Common in training robust models, this technique stabilizes training by gradually increasing the learning rate.

Results and Implications

The proposed methods, DAFormer and HRDA, were validated across various benchmarks, showing significant improvements in mean Intersection over Union (mIoU) scores, notably establishing new performance standards in synthetic-to-real and adverse condition domain adaptations. For example, on the GTA\rightarrowCityscapes task, HRDA achieved a remarkable increase of +16.3 mIoU over previous state-of-the-art methods.

The implications of this research extend beyond improvement in performance metrics. The method's reliance on more generalizable network architectures like Transformers opens avenues for further exploration in domain-agnostic training regimes and potentially reduces the reliance on extensive domain-specific fine-tuning. The proposed methods and the provision of open-source implementations offer a robust framework for researchers aiming to develop domain-agnostic models.

Future Directions

Future research could explore even more refined fusion strategies in multi-resolution settings and further investigate the interplay between resolution and scale in domain robustness. Additionally, integrating domain adaptation techniques in unsupervised and self-supervised learning paradigms may provide further enhancement in generalization capabilities across a broader set of tasks and domains.

Overall, this work represents a significant stride in advancing the field of domain adaptive semantic segmentation, offering practical tools and insights for ongoing challenges in machine learning applications across diverse environments.