Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis (1910.06809v3)

Published 15 Oct 2019 in cs.CV, cs.LG, and eess.IV

Abstract: Semantic image synthesis aims at generating photorealistic images from semantic layouts. Previous approaches with conditional generative adversarial networks (GAN) show state-of-the-art performance on this task, which either feed the semantic label maps as inputs to the generator, or use them to modulate the activations in normalization layers via affine transformations. We argue that convolutional kernels in the generator should be aware of the distinct semantic labels at different locations when generating images. In order to better exploit the semantic layout for the image generator, we propose to predict convolutional kernels conditioned on the semantic label map to generate the intermediate feature maps from the noise maps and eventually generate the images. Moreover, we propose a feature pyramid semantics-embedding discriminator, which is more effective in enhancing fine details and semantic alignments between the generated images and the input semantic layouts than previous multi-scale discriminators. We achieve state-of-the-art results on both quantitative metrics and subjective evaluation on various semantic segmentation datasets, demonstrating the effectiveness of our approach.

PDF Abstract

Learning to Predict Layout-to-Image Conditional Convolutions for Semantic Image Synthesis

Semantic image synthesis has emerged as a significant task within the domain of computer vision, with the aim of generating photorealistic images from semantic layouts. Existing approaches predominantly employ conditional generative adversarial networks (GANs) to achieve state-of-the-art performance. This paper, authored by Xihui Liu et al., introduces a novel methodology that contends with the limitations of conventional methods, emphasizing the necessity of convolutional kernels that are sensitive to the distinct semantic labels present at various spatial locations within an image.

Key Contributions and Methodology

Conditional Convolutional Kernel Prediction: Traditional approaches either input semantic label maps directly into the generator or use these maps to adjust activations through affine transformations within normalization layers. This paper posits that relying solely on translational-invariant convolutional kernels is inadequate, as it disregards the varying semantic content at different locations. Consequently, it proposes an innovative mechanism to predict convolutional kernels conditioned on semantic label maps. This approach facilitates a more precise and effective utilization of semantic layouts during image generation.
Feature Pyramid Semantics-Embedding Discriminator: In enhancing the fidelity of generated images, the researchers introduce a discriminator that emphasizes both high-fidelity details and semantic alignment with input layouts. Unlike prior multi-scale patch-based discriminators, this feature pyramid semantics-embedding discriminator exploits multi-scale feature pyramids to effectively enhance textures and spatial semantic alignment.
Depthwise Separable Convolutions: Addressing the infeasibility of naively predicting convolutional kernels due to excessive computational demand, the authors utilize depthwise separable convolutions. This significantly reduces the number of parameters without compromising the network's ability to control the image generation process based on semantic layout.
Empirical Evaluation: The proposed approach, encapsulated in the CC-FPSE framework, demonstrates superior performance on datasets such as Cityscapes, COCO-Stuff, and ADE20K. These results are validated through improved mean Intersection-over-Union (mIOU) measures and Fréchet Inception Distance (FID) scores, which exhibit notable gains compared to prior methods like pix2pixHD and SPADE.

Implications and Future Directions

The integration of conditional convolutions and feature pyramid discriminators represents a significant enhancement in semantic image synthesis. By allowing the semantic layout to explicitly control the generation process, this methodology can lead to more accurate and contextually consistent image synthesis outcomes. Furthermore, this approach's adaptability to complex scenes with varying objects underscores its practical applicability in tasks that demand high-quality and semantically coherent image generation.

Future developments in this field could pivot towards exploring finer granularity in kernel prediction and integrating attention mechanisms more deeply to handle even more nuanced image synthesis tasks. Additionally, extending these methods to three-dimensional data or other modalities presents an intriguing avenue for future exploration in the field of generative models.

In conclusion, this research articulates a significant advancement in leveraging semantic layouts for image synthesis, providing a foundation for subsequent innovations and applications in AI-driven visual content generation.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xihui Liu (92 papers)
Guojun Yin (19 papers)
Jing Shao (109 papers)
Xiaogang Wang (230 papers)
Hongsheng Li (340 papers)

Citations (197)

View on Semantic Scholar

Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis (1910.06809v3)