StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks (1710.10916v3)

Published 19 Oct 2017 in cs.CV, cs.AI, and stat.ML

Abstract: Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still face challenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) aiming at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture, StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of the object based on given text description, yielding low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture, StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generators and discriminators in a tree-like structure; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating multiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantly outperform other state-of-the-art methods in generating photo-realistic images.

Citations (1,001)

View on Semantic Scholar

Summary

The paper introduces a two-stage and multi-stage GAN framework that synthesizes high-resolution, photo-realistic images from both text and unconditional inputs.
It leverages innovative conditioning augmentation to improve training stability and boost sample diversity by perturbing the latent space.
Quantitative evaluations on datasets like CUB, Oxford-102, and COCO show improved Inception Scores and Fréchet Inception Distances compared to previous methods.

Realistic Image Synthesis using Stacked Generative Adversarial Networks

The paper discusses advanced techniques for generating high-resolution, photo-realistic images using Stacked Generative Adversarial Networks (StackGANs). The research introduces two models: the initial StackGAN-v1 for text-to-image synthesis, and an enhanced StackGAN-v2 for generalized tasks that include conditional and unconditional image generation.

Overview of StackGAN-v1

StackGAN-v1 addresses challenges in generating high-quality images by decomposing the generation process into two stages:

Stage-I GAN: The first stage generates low-resolution images (64x64 pixels) that capture the primitive shape and basic colors of the object according to the input text description. The inputs include a noise vector and conditioning variables derived from the text.
Stage-II GAN: This stage takes the output from Stage-I, refines it, and generates a high-resolution image (256x256 pixels). It corrects defects in the low-resolution image and adds details from the text description not captured in the initial stage.

The conditioning augmentation (CA) technique is introduced to improve the GAN training stability and sample diversity. CA allows small perturbations in the latent conditioning space, encouraging smoother transitions and more robust training dynamics.

Evaluation of StackGAN-v1

Experiments demonstrate that StackGAN-v1 significantly outperforms previous state-of-the-art methods (e.g., GAN-INT-CLS, GAWWN) across various datasets including CUB, Oxford-102, and COCO. The paper reports superior performance in terms of Inception Score (IS) and Fréchet Inception Distance (FID), as shown in Table~\ref{tab:cmp_previous}. The qualitative results also show clearer and more detailed images generated by StackGAN-v1 compared to existing methods (Figures~\ref{fig:cmp_previous} and \ref{fig:cmp_previous_flower}).

Overview of StackGAN-v2

StackGAN-v2 extends the capabilities of StackGAN-v1 by:

Using a multi-stage architecture where each stage approximates the data distribution at a different resolution. This includes a hierarchical structure with multiple generators and discriminators.
Jointly approximating conditional and unconditional distributions, which enhances robustness and training stability. Each stage receives conditioning on multiple distributions, thus leveraging the complimentary relationships between different scales and classes of information.
Introducing a color-consistency regularization term to maintain color coherence across different stages, which is particularly influential in the context of unconditional image generation.

Evaluation and Comparison of StackGAN-v1 and StackGAN-v2

Quantitative comparisons between StackGAN-v1 and StackGAN-v2 reveal that the latter consistently yields better performance in terms of FID and IS on most datasets (Table~\ref{tab:cmp_v1_v2}). Visual comparisons through t-SNE embeddings further illustrate that StackGAN-v2 does not suffer from mode collapse issues and maintains better sample diversity and quality (Figure~\ref{fig:tsne}).

Component Analysis

The paper also conducts a thorough component analysis of StackGAN-v1 and StackGAN-v2:

Conditioning Augmentation: Significantly improves training stability and sample diversity by introducing variability into the latent space (Figure~\ref{fig:Gaussian}).
Stage-wise Improvements: The two-stage approach of StackGAN-v1 is critical, with Stage-II refining and adding details missing from Stage-I (Figure~\ref{fig:lr2hr}).
Multi-scale Generators: StackGAN-v2's multi-scale architecture shows the importance of generating images at progressively higher resolutions to capture details incrementally (Table~\ref{tab:baseline}).
Joint Conditional and Unconditional Distributions: Joint approximations of various data distributions enhance the model's ability to generate more varied and realistic samples (Figures~\ref{fig:multiD}).

Implications and Future Directions

The implications of this research span both practical and theoretical domains. Practically, StackGANs demonstrate near state-of-the-art performance in generating detailed and high-resolution images, which can be valuable in fields like medical imaging, virtual reality, and content creation. Theoretically, the innovative multi-stage and conditioning techniques offer new directions for stabilizing GAN training and improving diversity.

Future developments could explore more sophisticated forms of conditioning augmentations, integrate semantic-level understanding, and apply these methodologies to even higher resolutions and more complex tasks, potentially further bridging the gap between synthesized and real-world images.

PDF Markdown

Related Papers

YouTube

Show All Videos