Domain Adaptive and Generalizable Network Architectures and Training Strategies for Semantic Image Segmentation (2304.13615v2)
Abstract: Unsupervised domain adaptation (UDA) and domain generalization (DG) enable machine learning models trained on a source domain to perform well on unlabeled or even unseen target domains. As previous UDA&DG semantic segmentation methods are mostly based on outdated networks, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network tailored for UDA&DG. It is enabled by three training strategies to avoid overfitting to the source domain: While (1) Rare Class Sampling mitigates the bias toward common source domain classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. As UDA&DG are usually GPU memory intensive, most previous methods downscale or crop images. However, low-resolution predictions often fail to preserve fine details while models trained with cropped images fall short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution framework for UDA&DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention. DAFormer and HRDA significantly improve the state-of-the-art UDA&DG by more than 10 mIoU on 5 different benchmarks. The implementation is available at https://github.com/lhoyer/HRDA.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016, pp. 3213–3223.
- C. Sakaridis, D. Dai, and L. Van Gool, “ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding,” in ICCV, 2021, pp. 10 765–10 775.
- G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in CVPR, 2016, pp. 3234–3243.
- S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in ECCV, 2016, pp. 102–118.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” PAMI, vol. 40, no. 4, pp. 834–848, 2017.
- Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” in CVPR, 2018, pp. 7472–7481.
- Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” in ECCV, 2020, pp. 173–190.
- A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” arXiv preprint arXiv:2005.10821, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in NeurIPS, 2021.
- S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding robustness of transformers for image classification,” in ICCV, 2021, pp. 10 231–10 241.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- W. Tranheden, V. Olsson, J. Pinto, and L. Svensson, “DACS: Domain Adaptation via Cross-domain Mixed Sampling,” in WACV, 2021, pp. 1379–1389.
- N. Araslanov and S. Roth, “Self-supervised augmentation consistency for adapting semantic segmentation,” in CVPR, 2021, pp. 15 384–15 394.
- Q. Wang, D. Dai, L. Hoyer, O. Fink, and L. Van Gool, “Domain adaptive semantic segmentation with self-supervised depth estimation,” in ICCV, 2021, pp. 8515–8525.
- J. Huang, S. Lu, D. Guan, and X. Zhang, “Contextual-relation consistent domain adaptation for semantic segmentation,” in ECCV, 2020, pp. 705–722.
- J. Yang, W. An, C. Yan, P. Zhao, and J. Huang, “Context-aware domain adaptation in semantic segmentation,” in WACV, 2021, pp. 514–524.
- L. Hoyer, D. Dai, and L. Van Gool, “DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation,” in CVPR, 2022.
- ——, “HRDA: Context-aware high-resolution domain-adaptive semantic segmentation,” in ECCV, 2022.
- P. Zhang, B. Zhang, T. Zhang, D. Chen, Y. Wang, and F. Wen, “Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation,” in CVPR, 2021, pp. 12 414–12 424.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
- H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in CVPR, 2018, pp. 7151–7160.
- L. Hoyer, M. Munoz, P. Katiyar, A. Khoreva, and V. Fischer, “Grid saliency for context explanations of semantic segmentation,” in NeurIPS, 2019, pp. 6462–6473.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in CVPR, 2019, pp. 3146–3154.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
- D. Lin, D. Shen, S. Shen, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang, “Zigzagnet: Fusing top-down and bottom-up context for object segmentation,” in CVPR, 2019, pp. 7490–7499.
- L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in CVPR, 2016, pp. 3640–3649.
- S. Yang and G. Peng, “Attention to refine through multi scales for semantic segmentation,” in Pacific Rim Conference on Multimedia, 2018, pp. 232–241.
- D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in ICLR, 2019.
- D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021, pp. 8340–8349.
- M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Intriguing Properties of Vision Transformers,” in NeurIPS, 2021.
- R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” in ICLR, 2018.
- C. Kamann and C. Rother, “Benchmarking the robustness of semantic segmentation models with respect to common corruptions,” IJCV, vol. 129, no. 2, pp. 462–483, 2021.
- X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning and generalization capacities via ibn-net,” in ECCV, 2018, pp. 464–479.
- S. Choi, S. Jung, H. Yun, J. T. Kim, S. Kim, and J. Choo, “Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening,” in CVPR, 2021, pp. 11 580–11 590.
- X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong, “Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data,” in ICCV, 2019, pp. 2100–2110.
- D. Peng, Y. Lei, L. Liu, P. Zhang, and J. Liu, “Global and local texture randomization for synthetic-to-real semantic segmentation,” TIP, vol. 30, pp. 6594–6608, 2021.
- Y. Zhao, Z. Zhong, N. Zhao, N. Sebe, and G. H. Lee, “Style-hallucinated dual consistency learning for domain generalized semantic segmentation,” in ECCV, 2022.
- Z. Zhong, Y. Zhao, G. H. Lee, and N. Sebe, “Adversarial style augmentation for domain generalized urban-scene segmentation,” in NeurIPS, 2022.
- Y. Zhao, Z. Zhong, N. Zhao, N. Sebe, and G. H. Lee, “Style-hallucinated dual consistency learning: A unified framework for visual domain generalization,” arXiv preprint arXiv:2212.09068, 2022.
- J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in ICML, 2018, pp. 1989–1998.
- R. Gong, W. Li, Y. Chen, D. Dai, and L. Van Gool, “Dlow: Domain flow and applications,” IJCV, vol. 129, no. 10, pp. 2865–2888, 2021.
- J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial and constraint-based adaptation,” arXiv preprint arXiv:1612.02649, 2016.
- T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in CVPR, 2019, pp. 2517–2526.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, 2014, pp. 2672–2680.
- Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in ECCV, 2018, pp. 289–305.
- D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool, “Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding,” IJCV, vol. 128, no. 5, pp. 1182–1204, 2020.
- D. Dai and L. Van Gool, “Dark model adaptation: Semantic image segmentation from daytime to nighttime,” in ITSC, 2018, pp. 3819–3824.
- L. Hoyer, D. Dai, H. Wang, and L. Van Gool, “MIC: Masked image consistency for context-enhanced domain adaptation,” in CVPR, 2023.
- Q. Zhou, Z. Feng, Q. Gu, J. Pang, G. Cheng, X. Lu, J. Shi, and L. Ma, “Context-aware mixup for domain adaptive semantic segmentation,” in WACV, 2021, pp. 514–524.
- L. Hoyer, D. Dai, Q. Wang, Y. Chen, and L. Van Gool, “Improving semi-supervised and domain-adaptive semantic segmentation with self-supervised depth estimation,” arXiv preprint arXiv:2108.12545, 2021.
- Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” in NeurIPS, 2017, pp. 7032–7042.
- V. Prabhu, S. Khare, D. Kartik, and J. Hoffman, “Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation,” in ICCV, 2021, pp. 8558–8567.
- Z. Li and D. Hoiem, “Learning without forgetting,” PAMI, vol. 40, no. 12, pp. 2935–2947, 2017.
- L. Hoyer, D. Dai, Y. Chen, A. Köring, S. Saha, and L. Van Gool, “Three ways to improve semantic segmentation with self-supervised depth estimation,” in CVPR, 2021, pp. 11 130–11 140.
- Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic segmentation of urban scenes,” in CVPR, 2018, pp. 7892–7901.
- H. Wang, T. Shen, W. Zhang, L.-Y. Duan, and T. Mei, “Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation,” in ECCV, 2020, pp. 642–659.
- M. N. Subhani and M. Ali, “Learning from scale-invariant examples for domain adaptation in semantic segmentation,” in ECCV, 2020, pp. 290–306.
- J. Iqbal and M. Ali, “Mlsl: Multi-level self-supervised learning for domain adaptation with spatially independent and semantically consistent labeling,” in WACV, 2020, pp. 1864–1873.
- X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in ICCV, 2017, pp. 1501–1510.
- Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized self-training,” in ICCV, 2019, pp. 5982–5991.
- A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017, pp. 1195–1204.
- V. Olsson, W. Tranheden, J. Pinto, and L. Svensson, “Classmix: Segmentation-based data augmentation for semi-supervised learning,” in WACV, 2021, pp. 1369–1378.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in ICLR, 2019.
- X. Lai, Z. Tian, X. Xu, Y. Chen, S. Liu, H. Zhao, L. Wang, and J. Jia, “Decouplenet: Decoupled network for domain adaptive semantic segmentation,” in ECCV, 2022, pp. 369–387.
- C. Sakaridis, D. Dai, and L. V. Gool, “Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation,” in ICCV, 2019, pp. 7374–7383.
- C. Sakaridis, D. Dai, and L. Van Gool, “Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation,” PAMI, 2020.
- X. Wu, Z. Wu, H. Guo, L. Ju, and S. Wang, “Dannet: A one-stage domain adaptation network for unsupervised nighttime semantic segmentation,” in CVPR, 2021, pp. 15 769–15 778.
- F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in CVPR, 2020, pp. 2636–2645.
- G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in ICCV, 2017, pp. 4990–4999.
- M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
- J. Huang, D. Guan, A. Xiao, and S. Lu, “Fsdr: Frequency space domain randomization for domain generalization,” in CVPR, 2021, pp. 6891–6902.
- D. Peng, Y. Lei, M. Hayat, Y. Guo, and W. Li, “Semantic-aware domain generalized segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2594–2605.
- D. Bashkirova, S. Mishra, D. Lteif, P. Teterwak, D. Kim, F. Alladkani, J. Akl, B. Calli, S. A. Bargal, K. Saenko et al., “Visda 2022 challenge: Domain adaptation for industrial waste sorting,” arXiv preprint arXiv:2303.14828, 2023.
- S. Saha, L. Hoyer, A. Obukhov, D. Dai, and L. Van Gool, “EDAPS: Enhanced domain-adaptive panoptic segmentation,” in ICCV, 2023.
- J. Xia, N. Yokoya, B. Adriano, and C. Broni-Bediako, “Openearthmap: A benchmark dataset for global high-resolution land cover mapping,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6254–6264.
- P. P. Rao, F. Qiao, W. Zhang, Y. Xu, Y. Deng, G. Wu, and Q. Zhang, “Quadformer: Quadruple transformer for unsupervised domain adaptation in power line segmentation of aerial images,” arXiv preprint arXiv:2211.16988, 2022.
- L. Huang, Y. Yuan, J. Guo, C. Zhang, X. Chen, and J. Wang, “Interlaced sparse self-attention for semantic segmentation,” arXiv preprint arXiv:1907.12273, 2019.
- H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha et al., “Resnest: Split-attention networks,” arXiv preprint arXiv:2004.08955, 2020.
- T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in ECCV, 2018, pp. 418–434.