Fully Convolutional Adaptation Networks for Semantic Segmentation (1804.08286v1)

Published 23 Apr 2018 in cs.CV

Abstract: The recent advances in deep neural networks have convincingly demonstrated high capability in learning vision models on large datasets. Nevertheless, collecting expert labeled datasets especially with pixel-level annotations is an extremely expensive process. An appealing alternative is to render synthetic data (e.g., computer games) and generate ground truth automatically. However, simply applying the models learnt on synthetic images may lead to high generalization error on real images due to domain shift. In this paper, we facilitate this issue from the perspectives of both visual appearance-level and representation-level domain adaptation. The former adapts source-domain images to appear as if drawn from the "style" in the target domain and the latter attempts to learn domain-invariant representations. Specifically, we present Fully Convolutional Adaptation Networks (FCAN), a novel deep architecture for semantic segmentation which combines Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN). AAN learns a transformation from one domain to the other in the pixel space and RAN is optimized in an adversarial learning manner to maximally fool the domain discriminator with the learnt source and target representations. Extensive experiments are conducted on the transfer from GTA5 (game videos) to Cityscapes (urban street scenes) on semantic segmentation and our proposal achieves superior results when comparing to state-of-the-art unsupervised adaptation techniques. More remarkably, we obtain a new record: mIoU of 47.5% on BDDS (drive-cam videos) in an unsupervised setting.

Authors (5)

Yiheng Zhang (19 papers)
Zhaofan Qiu (37 papers)
Ting Yao (127 papers)
Dong Liu (267 papers)
Tao Mei (209 papers)

Citations (341)

View on Semantic Scholar

Summary

Fully Convolutional Adaptation Networks for Semantic Segmentation

The paper "Fully Convolutional Adaptation Networks for Semantic Segmentation" addresses the challenges posed by domain shifts when applying deep learning models trained on synthetic data to real-world images. Collecting and annotating pixel-level datasets is a costly endeavor; thus, leveraging synthetic data with automatic ground truth generation presents a compelling alternative. However, the domain discrepancy leads to an increase in generalization error, a complication the authors confront through novel dual adaptation techniques.

The proposed methodology, Fully Convolutional Adaptation Networks (FCAN), ingeniously incorporates both appearance-level and representation-level adaptation mechanisms. The researchers define two main components within their architecture: Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN). AAN aims to minimize the visual domain gap by adjusting the appearance of images from the source domain to mimic those from the target domain, utilizing styles extracted as feature correlations. Meanwhile, RAN employs an adversarial learning strategy akin to generative adversarial networks (GANs) to learn domain-invariant feature representations. This involves training a domain discriminator that attempts to distinguish between source and target domain features, while the feature extractor learns to confuse the discriminator.

The authors conduct extensive experiments focusing on the task of unsupervised domain adaptation from synthetic video game data (GTA5) to real-world urban street scenes (Cityscapes). FCAN outperforms state-of-the-art unsupervised adaptation approaches, achieving a mean Intersection over Union (mIoU) of 46.6%—an improvement over the competitive benchmark, FCNWild, which previously scored 42.04%. Furthermore, the inclusion of multiple scales (multi-scale approach) further enhances FCAN’s performance to 47.75% mIoU, highlighting the importance of capturing features at various granularities.

Significantly, the paper also extends its evaluations to another dataset, BDDS, demonstrating the robustness of FCAN. The mIoU performance on BDDS improves markedly from 39.37% with FCNWild to 47.53% when employing an ensemble version of FCAN, which amalgamates models with different architectures (e.g., ResNet-101, ResNet-152, SENet). These empirical results strongly validate the proposed dual adaptation strategy's effectiveness in mitigating domain shift for semantic segmentation tasks.

The implications of this research for both theory and practice are considerable. Theoretically, the dual adaptation framework contributes to understanding how cross-domain invariances can be structured within a deep learning model, particularly for pixel-level tasks like semantic segmentation. Practically, this approach paves the way for adopting synthetic datasets more reliably across various real-world problems without the steep costs of exhaustive manual annotation.

Moving forward, further exploration could improve upon the aesthetic fidelity and domain transfer potency of the synthetic images within AAN. Additionally, broadening the FCAN framework to encompass other segmentation contexts, such as indoor environments or specific item delineation like human portraits, holds potential for extensive applicability. The dual-pathway design of FCAN serves as a promising template for future research endeavors in the domain adaptation space, substantially advancing the field of computer vision.

PDF Markdown

Related Papers

Find Related Papers