Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation (1912.12215v3)

Published 27 Dec 2019 in cs.CV, cs.LG, and eess.IV

Abstract: In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both the global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. The state-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.

Citations (134)

View on Semantic Scholar

Summary

The paper introduces LGGAN, a dual-generation strategy that integrates local class-specific and global image-level GANs for enhanced semantic scene generation.
It employs specialized sub-generators tied to semantic masks alongside a pixel-level fusion module, improving the rendering of small objects and intricate details.
The approach achieves state-of-the-art results on datasets like Cityscapes and ADE20K, with significant gains in mIoU, pixel accuracy, and FID scores.

An Examination of LGGAN: Local Class-Specific and Global Image-Level GANs for Semantic-Guided Scene Generation

The paper "Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation" presents an innovative approach for semantic-guided scene generation, which addresses challenges inherent in existing global image-level generation techniques. This research introduces a dual-focus GAN architecture, designated as LGGAN, integrating both local class-specific generation and global image-level synthesis to enhance the fidelity of generated scenes, particularly in handling small-scale objects and intricate details.

Overview of the Approach

The central innovation of this work lies in its dual generative network strategy. The LGGAN architecture consists of separate local class-specific generators that are capable of producing high-quality renderings of individual semantic classes, alongside a global image-level generator that captures the overarching structure of scenes. This design aims to mitigate the shortcomings of traditional GANs, particularly their struggles with rendering fine details and small objects, by leveraging semantic maps to guide class-specific feature learning.

The local generation component of the architecture employs multiple specialized sub-generators that each focus on distinct semantic classes, informed by semantic mask filtering. This is complemented by a global generator, which synthesizes a holistic scene layout. Notably, a classification module is introduced to enhance feature discrimination across classes, thereby improving the granularity of local textures.

Joint Optimization and Architecture Synergy

Crucially, LGGAN incorporates a pixel-level fusion weight map generator to blend local and global outputs, optimizing the integration of detail-oriented local generation with the structuring capabilities of the global generator. A dual-discriminator structure further supports the model, ensuring refinement in both image and semantic space fidelity.

Comprehensive evaluations demonstrate that LGGAN achieves state-of-the-art results across multiple datasets, including Cityscapes and ADE20K. Notably, it significantly surpasses benchmarks on metrics such as mean Intersection-over-Union (mIoU), pixel accuracy, and Fréchet Inception Distance (FID), underscoring its effectiveness in scene generation tasks.

Experimental Results and Implications

Experimental analysis highlights LGGAN's superior performance in generating detailed and plausible scenes. For instance, in cross-view image translation tasks, it outperforms several contemporary approaches, including SelectionGAN, by notable margins in top-1 and top-5 image retrieval accuracies, PSNR, and SSIM metrics.

In the field of semantic image synthesis, quantitative results confirm LGGAN's capability to generate more nuanced textures and structures. For example, in comparisons with GauGAN and other leading methods, LGGAN demonstrates improved mIoU and FID scores, indicating better congruence between generated outputs and real-world semantics.

Future Directions

The approach opens several avenues for future research. First, the integration of more sophisticated attention mechanisms could further enhance the selective feature focus of the local class-specific generators. Second, exploration into adaptive learning rates for different class-specific generators may optimize training efficiency, especially in datasets with imbalanced class distributions. Third, extending LGGAN to three-dimensional scene generation and exploration of its use in real-time applications could broaden its practical impact.

Conclusion

In summary, the proposed LGGAN framework represents a significant advancement in semantic-guided scene generation, offering a robust method for generating high-quality, detailed scene images that are aligned with semantic input. The introduction of class-specific local generation alongside global image integration provides a valuable blueprint for future advances in complex scene synthesis tasks.

PDF Markdown

Related Papers

GitHub

GitHub - Ha0Tang/LGGAN: [CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation (145 stars)