DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation (2111.14887v2)

Published 29 Nov 2021 in cs.CV

Abstract: As acquiring pixel-wise annotations of real-world images for semantic segmentation is a costly process, a model can instead be trained with more accessible synthetic data and adapted to real images without requiring their annotations. This process is studied in unsupervised domain adaptation (UDA). Even though a large number of methods propose new adaptation strategies, they are mostly based on outdated network architectures. As the influence of recent network architectures has not been systematically studied, we first benchmark different network architectures for UDA and newly reveal the potential of Transformers for UDA semantic segmentation. Based on the findings, we propose a novel UDA method, DAFormer. The network architecture of DAFormer consists of a Transformer encoder and a multi-level context-aware feature fusion decoder. It is enabled by three simple but crucial training strategies to stabilize the training and to avoid overfitting to the source domain: While (1) Rare Class Sampling on the source domain improves the quality of the pseudo-labels by mitigating the confirmation bias of self-training toward common classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. DAFormer represents a major advance in UDA. It improves the state of the art by 10.8 mIoU for GTA-to-Cityscapes and 5.4 mIoU for Synthia-to-Cityscapes and enables learning even difficult classes such as train, bus, and truck well. The implementation is available at https://github.com/lhoyer/DAFormer.

PDF Abstract

Enhancements in Domain-Adaptive Semantic Segmentation through Advanced Network Architectures and Training Strategies

The paper "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation" explores innovations in the field of unsupervised domain adaptation (UDA) for semantic segmentation—the process of segmenting an image into its constituent parts without a significant amount of labeled data for the target domain. The authors propose DAFormer, a foundational shift in both network architecture and the training methodologies used for this task, aiming to bridge the gap between synthetic source domain data and real-world target domain data.

The motivation for this research is rooted in the high cost and labor associated with acquiring pixel-wise annotations for real-world images. Instead, it is more practical to utilize synthetic data for training and subsequently adapt these models to real-world images using UDA. However, previous methods primarily relied on network architectures, like DeepLabV2 and FCN8s, which are now considered outdated compared to recent advancements in network designs.

DAFormer: Network Design and Adaptation Strategies

DAFormer, as presented in the paper, incorporates a Transformer-based encoder architecture with a multi-level context-aware feature fusion decoder. The choice of Transformers over conventional Convolutional Neural Networks (CNNs) is a significant innovation. Evidence from this paper suggests Transformers are more robust against the domain shift challenge inherent in UDA, possibly due to their enhanced capacity to capture more generalizable features across different input distributions.

The authors demonstrate that using the Mix Transformer (MiT) as a backbone leverages robustness and generalization Avenues not entirely exploited by CNNs. Transformers provide an edge in dealing with semantic segmentation involving synthetic to real data transfer, as they inherently focus on object shapes more than textures—similar to human visual processing mechanics. This attribute potentially offers better segmentation outcomes on the target domain.

Training Techniques

The paper's exploration extends beyond architecture to the utilization of novel training methodologies that mitigate overfitting—a common pitfall where a model learns noise in source data that doesn't generalize well to target data. The three principal contributions here are:

Rare Class Sampling (RCS): Addresses the data imbalance by increasing the frequency of rare classes in the training dataset, counteracting the dominating influence of common classes and improving model performance on underrepresented categories.
Thing-Class ImageNet Feature Distance (FD): Introduces regularization via a distance metric between features learned from ImageNet and the current task, maintaining model focus on generic, transferable features rather than domain-specific artifacts of the source synthetic data.
Learning Rate Warmup: Stably introduces learning rate adjustments to avoid erratic early-stage training behavior, benefiting from the pretrained features from ImageNet.

Empirical Evidence and Implications

The empirical outcome of incorporating DAFormer and the proposed training strategies is compelling. The improvements over existing state-of-the-art methods are significant: an increase in mean Intersection over Union (mIoU) by 10.8 points on the GTA to Cityscapes benchmark and 5.4 points on the Synthia to Cityscapes benchmark. Such results substantiate the assertion that modern network architectures, when effectively coupled with innovative training practices, can substantially advance the field of UDA for semantic segmentation.

Future Directions

The implications of this work are multi-fold for the future of UDA and possibly broader AI and machine vision tasks. As UDA becomes increasingly reliant on sophisticated architectures like Transformers, further research could explore the fine-tuning of these novel architectures specifically for segmentation tasks. Moreover, the integration of UDA with semi-supervised learning frameworks, where a few labeled instances from the target domain can also inform model training, looks promising for amplifying model accuracy while reducing annotation costs. Another intriguing future direction could be exploring the synergistic blending of domain-invariant and domain-specific feature-learning strategies to enhance performance across diverse and dynamically evolving real-world environments.

In closing, DAFormer exemplifies a sophisticated convergence of new architecture and strategic training, representing a significant step forward for practical, scalable, and efficient solutions in domain-adaptive semantic segmentation.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Lukas Hoyer (21 papers)
Dengxin Dai (99 papers)
Luc Van Gool (569 papers)

Citations (397)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos