Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

Published 12 Apr 2021 in cs.CV, cs.AI, and cs.LG | (2104.05833v1)

Abstract: Training deep networks with limited labeled data while achieving a strong generalization ability is key in the quest to reduce human annotation efforts. This is the goal of semi-supervised learning, which exploits more widely available unlabeled data to complement small labeled data sets. In this paper, we propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels. Concretely, we learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images supplemented with only few labeled ones. We build our architecture on top of StyleGAN2, augmented with a label synthesis branch. Image labeling at test time is achieved by first embedding the target image into the joint latent space via an encoder network and test-time optimization, and then generating the label from the inferred embedding. We evaluate our approach in two important domains: medical image segmentation and part-based face segmentation. We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization, such as transferring from CT to MRI in medical imaging, and photographs of real faces to paintings, sculptures, and even cartoons and animal faces. Project Page: \url{https://nv-tlabs.github.io/semanticGAN/}

Abstract PDF Upgrade to Chat

Citations (164)

View on Semantic Scholar

Summary

Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization

The paper presents a novel approach to semantic segmentation, leveraging generative models to achieve semi-supervised learning and enhanced out-of-domain generalization capabilities. This approach addresses the challenge of training deep networks with limited labeled data while reducing the significant human annotation costs endemic to pixel-level tasks like semantic segmentation.

Framework and Methodology

The proposed method integrates a generative adversarial network (GAN) that models the joint distribution of images and labels, effectively synthesizing both from the inferred embeddings. This setup allows the use of a large set of unlabeled images supplemented with only a few labeled ones, facilitating a semi-supervised training environment. The GAN architecture is built upon the StyleGAN2 model, augmented to include a label synthesis branch. Notably, the model trains using adversarial objectives without relying on pairwise losses such as cross-entropy.

Practical Implementation and Evaluation

For image labeling in practice, the paper proposes embedding the target image into a joint latent space using an encoder network followed by test-time optimization. This embedding process allows the generation of labels based on the inferred latent representation. Evaluations are conducted in domains of medical image segmentation and part-based face segmentation, demonstrating competitive results in in-domain tasks and notable success in out-of-domain generalizations, such as transferring from CT to MRI in medical imaging, and from photographs of real faces to representations in paintings and cartoons.

Results and Comparative Analysis

The results indicate superior performance relative to existing baselines across various datasets. When trained on datasets with minimal labeled examples and abundant unlabeled examples, the proposed method achieved higher DICE scores and JC indices compared to U-Net, DeepLab, and several semi-supervised methods like Mean Teacher (MT), adversarial training for SSL (AdvSSL), and Guided Collaborative Training (GCT). Furthermore, the model exhibits strong generalization capabilities to out-of-domain datasets, surpassing the performance of baseline models by a substantial margin.

For the face part segmentation task, the generative model demonstrates not only competitive in-domain performance but also excels in segmenting human face paintings and sculptures, underscoring its capabilities in semantically understanding image characteristics across different representations.

Implications and Future Prospects

The paper challenges conventional discriminative models by proposing a generative framework that inherently facilitates semi-supervised learning, proving effective even when labeled data is scarce. While GANs demand extensive training data, the generative model's application to semantic segmentation tasks showcases promising results with potential implications for medical imaging and other fields requiring pixel-level precision.

Future developments in AI could focus on optimizing generative models for real-time segmentation tasks, further reducing test-time optimization requirements. Augmentation strategies that enhance GAN training could offer additional pathways to extending the model's robustness and applicability across a broader spectrum of datasets and tasks. Ultimately, this approach bolsters the evolving role of generative models in facilitating sophisticated image understanding and representation tasks.

Markdown