Photographic Image Synthesis with Cascaded Refinement Networks (1707.09405v1)

Published 28 Jul 2017 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: We present an approach to synthesizing photographic images conditioned on semantic layouts. Given a semantic label map, our approach produces an image with photographic appearance that conforms to the input layout. The approach thus functions as a rendering engine that takes a two-dimensional semantic specification of the scene and produces a corresponding photographic image. Unlike recent and contemporaneous work, our approach does not rely on adversarial training. We show that photographic images can be synthesized from semantic layouts by a single feedforward network with appropriate structure, trained end-to-end with a direct regression objective. The presented approach scales seamlessly to high resolutions; we demonstrate this by synthesizing photographic images at 2-megapixel resolution, the full resolution of our training data. Extensive perceptual experiments on datasets of outdoor and indoor scenes demonstrate that images synthesized by the presented approach are considerably more realistic than alternative approaches. The results are shown in the supplementary video at https://youtu.be/0fhUJT21-bs

Citations (929)

View on Semantic Scholar

Summary

The paper introduces a cascaded refinement network that generates photorealistic images from semantic layouts, bypassing adversarial training.
The approach uses progressive refinement via a hierarchy of modules to ensure global consistency and high-resolution details.
Experiments show that CRN outperforms GAN-based methods in realism and scalability, establishing a new benchmark in image synthesis.

Photographic Image Synthesis with Cascaded Refinement Networks

The paper "Photographic Image Synthesis with Cascaded Refinement Networks" by Qifeng Chen and Vladlen Koltun introduces a sophisticated approach for synthesizing photographic images conditioned on semantic layouts. This method leverages a single feedforward convolutional network trained end-to-end using a direct regression loss, markedly diverging from the prevalent Generative Adversarial Network (GAN) paradigm.

Contributions

The paper presents several key contributions to the field of image synthesis:

Approach Description: The authors propose using a Cascaded Refinement Network (CRN) to generate images from pixelwise semantic layouts. A CRN progressively refines the image synthesis process through a hierarchical cascade of refinement modules, ensuring both global structural consistency and high resolution in the output.
Avoidance of Adversarial Training: Unlike recent works that predominantly rely on adversarial training, the CRN avoids the instability typically associated with GANs. The approach employs supervised training with a feature matching loss, showing that this method can achieve substantially more realistic results without the complex balancing of adversarial losses.
Scalability: The proposed CRN architecture scales seamlessly to high resolutions, successfully synthesizing images at 2 MP (1024x2048), which aligns with the full resolution of their training data.
Diverse Image Synthesis: They extend the model to synthesize a diverse collection of images for a given semantic layout by modifying the network to output multiple images and introducing a loss function that encourages diversity within the synthesized collection.

Experimental Evaluation

The authors conduct extensive perceptual experiments to compare their CRN approach against several baselines including:

GAN combined with semantic segmentation.
A high-capacity full-resolution feedforward network.
A U-Net encoder-decoder architecture.
The image-to-image translation approach by Isola et al.

The evaluation employs pairwise A/B tests on Amazon Mechanical Turk, which compare the realism of synthesized images. The results are compelling. The CRN-generated images are consistently rated more realistic than those produced by the baselines, with statistical significance ( $p < 10^{-3}$ ). In pairwise comparisons on the Cityscapes dataset, the CRN images were rated more realistic than images from the approach by Isola et al. in 97% of cases. On the NYU indoor scene dataset, the results showed a similar trend.

Additionally, timed comparisons demonstrated that even with very brief exposure times (as low as $\frac{1}{8}$ second), observers could distinguish real images from those synthesized by the GAN-based approaches, but not from those synthesized by the CRN. This suggests that the CRN-generated images contain convincingly realistic high-frequency detail from the outset.

Implications

The practical implications of this research are substantial:

Computer Graphics: This method could streamline the content creation process in computer graphics, potentially substituting complex 3D modeling and rendering pipelines with direct image synthesis from semantic layouts.
AI Planning and Cognition: Given the role of visual simulation in human cognition, synthesis of high-quality imagery could enhance artificial intelligence systems involved in planning and decision-making.

The theoretical implications include validation that direct regression-based synthesis can surpass adversarial approaches in realism, challenging the current dominance of GANs in image synthesis tasks. Furthermore, the ability to synthesize diverse sets of images from a given semantic layout introduces new opportunities for modeling and simulation applications in AI research.

Future Directions

The authors acknowledge that while their approach significantly advances the realism of synthesized images, they are not yet indistinguishable from real HD images. Future work could focus on:

Improving Photorealism: Further refinement of the network architecture and loss functions to close the gap between synthesized and real images.
Generalization to Other Domains: Extending the method to other types of visual content beyond urban and indoor scenes.
Efficiency: Optimizing the model to reduce the computational resources required, thereby facilitating broader adoption and real-time applications.

In conclusion, the paper presents a robust and scalable approach to image synthesis that bypasses the pitfalls of adversarial training, offering a viable alternative within the domain of photorealistic image generation. This work not only sets a new benchmark in the field but also opens up exciting avenues for future research and practical applications in computer graphics and artificial intelligence.

PDF Markdown

Related Papers

YouTube

Show All Videos