StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (1612.03242v2)

Published 10 Dec 2016 in cs.CV, cs.AI, and stat.ML

Abstract: Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256x256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

Authors (7)

Han Zhang (338 papers)
Tao Xu (133 papers)
Hongsheng Li (340 papers)
Shaoting Zhang (133 papers)
Xiaogang Wang (230 papers)
Xiaolei Huang (45 papers)
Dimitris Metaxas (85 papers)

Citations (2,640)

View on Semantic Scholar

Summary

Stacked Generative Adversarial Networks (StackGAN) for Text to Photo-realistic Image Synthesis

The paper "StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks" addresses the challenge of generating high-resolution, photo-realistic images conditioned on text descriptions. Existing text-to-image synthesis methods often struggle to maintain high fidelity and detailed features when scaling to higher resolutions. The StackGAN framework proposes a novel two-stage Generative Adversarial Network (GAN) system designed to decompose this complex problem into more manageable sub-tasks, thereby achieving higher quality outputs.

Methodology

The key innovation in StackGAN lies in its two-stage process:

Stage-I GAN: The first stage focuses on generating a low-resolution (64x64) image that captures the basic shape and predominant colors of the described object. It uses a conditional GAN architecture where the generator G_0 and discriminator D_0 are informed by a textual description. This description is encoded to a feature vector using a pre-trained text encoder, with the Gaussian conditioning variables $\hat{c}_0$ sampled to introduce variation and robustness.
Stage-II GAN: The second stage refines the output of Stage-I, generating a high-resolution (256x256) image that enhances and corrects the lower resolution image. This stage integrates an encoder-decoder network with residual blocks to further process the text description, thereby adding finer, photo-realistic details to the image produced by the first stage. Unlike Stage-I, the Stage-II GAN benefits from a richer augmentation of the conditioning variables, which allows it to correct defects and imbue the image with additional details aligned with the text.

An additional contribution, the Conditioning Augmentation (CA) technique, is introduced to increase the diversity of synthesized images and stabilize the GAN training. CA helps counteract the issues of data sparsity and discontinuity in the latent space by perturbing the conditioning variables drawn from the text embeddings.

Experimental Results

Empirical evaluations on the CUB, Oxford-102, and MS COCO datasets underline the efficacy of the StackGAN architecture. Quantitative metrics such as the inception score, as well as qualitative human evaluations, indicate substantial improvements over prior methods.

Inception Scores: StackGAN achieved the highest inception scores across all datasets, with notable improvements over baseline GAN-INT-CLS. For example, on the CUB dataset, the inception score improved from 2.88 to 3.70, reflecting the enhanced image quality and realism.
Human Evaluations: In human rankings, StackGAN consistently outperformed other methods (GAN-INT-CLS and GAWWN), indicating that the generated images better captured the essence of the provided descriptions.

Visual Analysis:

The paper provides several visual comparisons demonstrating that StackGAN's two-stage approach significantly improves detail and realism. While Stage-I images depict basic shapes and colors, Stage-II images add finer details such as textures and specific object features, which are more consistent with real-world images.

Nearest-Neighbor Retrieval:

To ensure the model is not merely memorizing training data, the authors conduct nearest-neighbor retrievals using features extracted from the Stage-II discriminator. Generated images differ meaningfully from their closest training counterparts, confirming the model's capability to generalize and synthesize novel content.

Implications and Future Directions

The StackGAN framework advances the state-of-the-art in text-to-image synthesis by addressing high-resolution generation challenges through a staged approach. Practically, this technique could be applied in fields like computer-aided design, personalized content generation, and virtual reality, where generating detailed and contextually relevant images from descriptive input is invaluable.

Theoretically, StackGAN opens up new avenues in hierarchical learning and decomposed problem solving with GANs. Future work could explore deeper stacking architectures, more sophisticated conditioning techniques, and applications to more complex datasets incorporating multiple objects and dynamic scenes. Further refinement could also address current limitations observed in datasets like COCO, where the complexity of scenes affects the quality.

In summary, this paper presents a significant methodical advance for generating high-resolution images from textual descriptions, evidenced by strong experimental results and practical usability in various computer vision applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/eli_schein/status/1834371141151891763

YouTube

Show All Videos