High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs (1711.11585v2)

Published 30 Nov 2017 in cs.CV, cs.GR, and cs.LG

Abstract: We present a new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs). Conditional GANs have enabled a variety of applications, but the results are often limited to low-resolution and still far from realistic. In this work, we generate 2048x1024 visually appealing results with a novel adversarial loss, as well as new multi-scale generator and discriminator architectures. Furthermore, we extend our framework to interactive visual manipulation with two additional features. First, we incorporate object instance segmentation information, which enables object manipulations such as removing/adding objects and changing the object category. Second, we propose a method to generate diverse results given the same input, allowing users to edit the object appearance interactively. Human opinion studies demonstrate that our method significantly outperforms existing methods, advancing both the quality and the resolution of deep image synthesis and editing.

Authors (6)

Ting-Chun Wang (26 papers)
Ming-Yu Liu (87 papers)
Jun-Yan Zhu (80 papers)
Andrew Tao (40 papers)
Jan Kautz (215 papers)
Bryan Catanzaro (123 papers)

Citations (3,756)

View on Semantic Scholar

Summary

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

This paper discusses an advanced method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (GANs). Traditional methods for photo-realistic image rendering are computationally expensive due to the modeling of geometry, materials, and light transport. The proposed approach aims to address these challenges by using data-driven model learning and inference, which has the potential to simplify the process of creating and editing virtual environments.

Synthesis Methodology

The core contribution of the paper is the introduction of a generative framework capable of producing $2048 \times 1024$ resolution images. The framework employs a novel adversarial loss and new multi-scale generator and discriminator architectures. These innovations allow for high-resolution image generation without relying on hand-crafted losses or pre-trained networks, such as VGGNet for perceptual losses.

High-Resolution Image Generation

Coarse-to-Fine Generator Architecture:
- The generator consists of a global generator network and a local enhancer network.
- The global generator operates at $1024 \times 512$ resolution, and the local enhancer further refines the image to $2048 \times 1024$ resolution.
- The generator efficiently aggregates global and local information, producing high-quality images.
Multi-Scale Discriminators:
- The framework incorporates three multi-scale discriminators that operate at different image scales.
- These discriminators help in distinguishing between real and synthesized images and guide the generator to produce globally consistent and detailed images.
Improved Adversarial Loss:
- The paper introduces a feature matching loss based on the discriminator, stabilizing the training by ensuring natural image statistics at multiple scales.
- The objective function combines GAN loss and feature matching loss, which significantly enhances the quality of the generated images.

Interactive Semantic Manipulation

The paper extends the framework to interactive visual manipulation by incorporating object instance segmentation information and proposing a method for generating diverse results:

Instance-Level Object Segmentation:
- The inclusion of instance maps allows object manipulations such as adding/removing objects and changing object categories.
- An instance boundary map is used to capture critical object boundaries, improving the realism around object edges.
Instance-Level Feature Embedding:
- An encoder network is trained to derive low-dimensional feature vectors for individual instances, enabling diverse and controllable image synthesis.
- Users can interactively edit object appearances, such as changing colors and textures, providing a flexible tool for image manipulation.

Quantitative and Qualitative Comparisons

Extensive evaluations demonstrate the superiority of the proposed method:

Quantitative Analysis:
- Semantic segmentation accuracy is used as a metric, showing that the result quality is very close to that of the original images.
- The proposed method outperforms state-of-the-art methods in both pixel-wise accuracy and mean intersection-over-union (IoU).
Human Perceptual Study:
- Pairwise A/B tests on Amazon Mechanical Turk reveal a substantial preference for images generated by the proposed method over those by previous methods.
- The method shows consistent improvements over competitors in producing realistic textures and details, even under limited time evaluations.

Practical and Theoretical Implications

The results indicate that conditional GANs can effectively synthesize high-resolution images suitable for various applications, including creating synthetic training data for visual recognition tasks and high-level image editing. The ability to render photo-realistic images using a data-driven approach simplifies the creation and manipulation of virtual environments.

Future Directions

Considering the promising results, future research could explore:

Integration of domain-specific constraints to further enhance image realism.
Expansion of the framework to other domains such as medical imaging and biological data synthesis, where high-resolution and realistic results are crucial.
Development of interactive systems leveraging the proposed framework for real-time applications in graphics and virtual reality.

In conclusion, this work presents significant advancements in high-resolution image synthesis and semantic manipulation using conditional GANs. The proposed methodologies and results underline the potential for conditional GANs to revolutionize the process of graphics rendering and image editing, offering both practical tools and new research avenues in the domain of computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos