Semantic Image Synthesis with Spatially-Adaptive Normalization (1903.07291v2)

Published 18 Mar 2019 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to ``wash away'' semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE .

Citations (2,552)

View on Semantic Scholar

Summary

The paper presents SPADE, a novel normalization technique that modulates activations using learned semantic layout transformations for improved image synthesis.
Experimental results show a significant boost in mIoU and visual fidelity on datasets like COCO-Stuff, ADE20K, and Cityscapes.
User studies confirm that images generated with SPADE surpass competitors in realism and adherence to input segmentation masks.

Semantic Image Synthesis with Spatially-Adaptive Normalization

The paper "Semantic Image Synthesis with Spatially-Adaptive Normalization" presents a novel approach to conditional image synthesis by introducing a new type of normalization layer, termed as SPatially-Adaptive (DE)normalization (SPADE). Traditional methods that feed input semantic layouts directly into deep networks often struggle because the normalization layers can wash away the semantic information. SPADE addresses this by modulating the activations in the normalization layers through transformations learned from the input semantic layouts, thereby preserving the vital information of the input throughout the network.

Key Contributions

The primary methodological contribution of this paper is the design and implementation of the SPADE layer, which enhances the fidelity and alignment of generated images with their corresponding semantic layouts. This is achieved through:

Spatially-Adaptive Normalization: Unlike conventional normalization methods that treat all pixels uniformly, SPADE performs a location-dependent affine transformation based on the input segmentation mask. This allows the preservation of semantic information throughout the network.
Integration into Generative Networks: By embedding SPADE into generator networks, the authors were able to significantly improve the synthesis of photorealistic images from semantic segmentation masks, even for diverse and challenging datasets.

Experimental Results

The authors conducted extensive experiments on several complex datasets including COCO-Stuff, ADE20K, and Cityscapes. The model was benchmarked against leading semantic image synthesis methods like pix2pixHD, cascaded refinement networks, and semi-parametric methods.

Quantitative Performance: SPADE exhibited superior performance across all evaluated datasets. For instance, on the COCO-Stuff dataset, SPADE achieved a mean Intersection-over-Union (mIoU) score of 35.2, a significant improvement over previous methods.
Visual Quality: The method produced images with considerably better visual fidelity and fewer artifacts. The synthesized images more accurately adhered to the input segmentation masks.
User Studies: Human evaluations conducted via Amazon Mechanical Turk confirmed that synthesized images from SPADE were generally preferred over those from competitors, further underscoring its effectiveness.

Baseline Comparisons and Ablation Studies

Several ablation studies were conducted to isolate the impact of the SPADE layers:

Baselines & Variants: The authors introduced a robust baseline, pix2pixHD++, which integrates several performance-enhancing techniques aside from SPADE. Comparisons showed that integrating SPADE led to significant performance enhancements even in strong baseline models.
Normalization Layers: Different variants of the normalization layers were tested, demonstrating that SPADE consistently outperformed these variants. This indicates the robustness and effectiveness of the spatial adaptiveness of SPADE.

Practical and Theoretical Implications

Practical Applications:

Content Generation & Editing: The technique can be instrumental in applications like content creation, where precise control over the output is essential. Image editing, where parts of the image need to be generated based on semantic labels, can benefit significantly from this approach.
Semantic Manipulation: Users can guide the synthesis process to generate diverse outputs from the same semantic input, making it valuable for scenarios requiring multiple stylistic variations from a single segmentation mask.

Theoretical Contributions:

Normalization Techniques: SPADE introduces a shift in how normalization layers can be utilized, particularly in synthesis tasks that depend heavily on retaining semantic information. This can influence future designs of normalization layers in other conditional generation tasks.
Semantic Information Propagation: It demonstrates the importance of propagating semantic information efficiently and provides a methodology to do so within deep network architectures.

Future Developments

The success of the SPADE layer opens several avenues for future research:

Broader Applications: Exploring the application of SPADE in other conditional generation tasks and evaluating its effectiveness in domains beyond image synthesis.
Enhanced Control: Developing more advanced control mechanisms that allow users to more intuitively manipulate the generation process, potentially through interactive interfaces.
Integration with New Architectures: Combining SPADE with other emerging techniques in generative networks to further enhance image quality and fidelity.

In conclusion, the introduction of spatially-adaptive normalization has brought forth significant advancements in the field of semantic image synthesis, providing a robust foundation for further research and practical applications in various AI-driven content creation and editing tasks.

PDF Markdown

Related Papers

GitHub

GitHub - NVlabs/SPADE: Semantic Image Synthesis with SPADE (7,576 stars)

YouTube

Show All Videos