You Only Need Adversarial Supervision for Semantic Image Synthesis (2012.04781v3)

Published 8 Dec 2020 in cs.CV, cs.LG, and eess.IV

Abstract: Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training. By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous. Moreover, we enable high-quality multi-modal image synthesis through global and local sampling of a 3D noise tensor injected into the generator, which allows complete or partial image change. We show that images synthesized by our model are more diverse and follow the color and texture distributions of real images more closely. We achieve an average improvement of $6$ FID and $5$ mIoU points over the state of the art across different datasets using only adversarial supervision.

Authors (6)

Vadim Sushko (7 papers)
Edgar Schönfeld (21 papers)
Dan Zhang (171 papers)
Juergen Gall (121 papers)
Bernt Schiele (210 papers)
Anna Khoreva (27 papers)

Citations (171)

View on Semantic Scholar

Summary

The paper introduces OASIS, a GAN model that removes the need for VGG perceptual loss by relying on adversarial supervision for semantic image synthesis.
It leverages a segmentation-based discriminator with LabelMix regularization to provide pixel-level semantic feedback and improve structural accuracy.
The approach yields significant improvements in FID and mIoU across benchmarks, simplifying training while enhancing image diversity through multi-modal synthesis.

Overview of "You Only Need Adversarial Supervision for Semantic Image Synthesis"

The paper "You Only Need Adversarial Supervision for Semantic Image Synthesis" by Sushko et al. addresses semantic image synthesis using Generative Adversarial Networks (GANs). The primary contribution of the work is the development of a simplified GAN model, known as OASIS, that relies solely on adversarial supervision to achieve high-quality image synthesis, eliminating the need for the VGG-based perceptual loss traditionally used to enhance image quality.

Key Contributions and Methodology

The authors propose several novel modifications to the conventional GAN architecture used for semantic image synthesis:

Segmentation-based Discriminator: The paper introduces a redesigned discriminator which is structured as a semantic segmentation network. Instead of using the traditional global image-level real/fake classification, this discriminator leverages a ( $N$ +$1$)-class cross-entropy loss where $N$ denotes the number of semantic classes. This setup helps provide semantically-aware, pixel-level feedback to the generator.
Empirical Evaluation with LabelMix Regularization: To enhance the discriminator, the concept of LabelMix is utilized, which generates mixed images with noise and true semantic boundaries. This regularization strategy fosters the discriminator to focus on structural differences rather than adversarial noise.
3D Noise Tensor for Multi-modal Synthesis: OASIS introduces a versatile multi-modal image synthesis using a 3D noise tensor that can be sampled at both global and local levels. This allows for controlled variations in the generated images, fostering diversity while maintaining alignment with the input label maps.

Results and Findings

The empirical results highlight the efficacy of OASIS:

The approach demonstrated a consistent improvement in both Fréchet Inception Distance (FID) and mean Intersection-over-Union (mIoU) across datasets like ADE20K, Cityscapes, and COCO-Stuff, improving state-of-the-art results by an average of 6 FID and 5 mIoU points over traditional models.
By eliminating the need for computationally expensive perceptual loss, OASIS shows promise in simplifying the training process without sacrificing performance.

Implications and Future Work

The OASIS model presents a compelling case for revisiting the reliance on perceptual losses in GAN-based image synthesis models, advocating for simpler yet effective adversarial-only frameworks. By focusing on aligning generated images more precisely with semantic labels and enhancing image diversity, further work could explore the application of such architectures in domains requiring high-fidelity image generation under strict semantic constraints, such as simulation environments and virtual content creation.

In conclusion, this work proposes a novel GAN model for semantic image synthesis that simplifies the conventional framework while achieving superior results. The elimination of the VGG loss, combined with innovative architectural changes, not only improves the performance but also opens up future avenues for efficient and effective semantic synthesis in AI applications.

PDF Markdown

Related Papers

YouTube

Show All Videos