Attribute2Image: Conditional Image Generation from Visual Attributes (1512.00570v2)

Published 2 Dec 2015 in cs.LG, cs.AI, and cs.CV

Abstract: This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.

Authors (4)

Xinchen Yan (22 papers)
Jimei Yang (58 papers)
Kihyuk Sohn (54 papers)
Honglak Lee (174 papers)

Citations (758)

View on Semantic Scholar

Summary

Attribute2Image: Conditional Image Generation from Visual Attributes

The paper "Attribute2Image: Conditional Image Generation from Visual Attributes" by Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee addresses the novel problem of generating images based on high-level attribute descriptions using a layered generative model with disentangled latent variables. This work leverages variational auto-encoders (VAEs) to achieve this objective, extending the generative capabilities in computer vision significantly.

Methodology

The proposed framework, Attribute2Image, uses Conditional Variational Auto-Encoders (CVAE) to generate images by conditioning on specific visual attributes. The model disentangles the latent space into foreground and background components, providing a structured approach to image generation.

Layered Generative Model

Instead of traditional, monolithic generative models, this paper introduces a layered model comprising foreground and background images that are combined via a gating mechanism. This structured approach aligns with the compositional nature of many images:

Foreground Generation: The model generates a foreground layer and its corresponding shape map that defines its visibility over the background.
Background Generation: Independently generates the background using latent background variables.
Composite Image: The final image is composed by overlaying the foreground on the background as dictated by the shape map.

CVAE and Disentangled Generation

The authors leverage CAVE to learn the mapping from attributes and disentangled latent variables to the image space. Here, the foreground-related and background-related latent factors are independently processed before being combined:

Foreground Network: Models the object attributes and foreground uncertainties.
Background Network: Handles aspects of the background using latent variables independent of the foreground.

Posterior Inference via Optimization

To infer the latent variables from a given image under the trained model, the authors introduce an optimization-based latent space search. This involves back-propagation of a reconstruction loss starting from the provided image to estimates of the latent variables.

Experimental Results

The efficacy of the proposed models was validated on two datasets: Labeled Faces in the Wild (LFW) and Caltech-UCSD Birds 200 (CUB-200). The results of the experiments are summarized as follows:

Attribute-conditioned Image Generation

The model demonstrated the ability to generate realistic and diverse images from given attribute descriptions. In quantitative terms, the results showed:

For LFW, a cosine similarity of 0.9057 and mean squared error of 16.71 in attribute space.
For CUB-200, qualitative results highlighted the superiority of the proposed model over traditional methods.

Image Reconstruction and Completion

The model was tested on its capacity to:

Reconstruct images from latent variables optimized from original images.
Complete images where parts were synthetically occluded.

The disCVAE model showed superior performance, achieving a mean squared error of 10.0 ± 0.1 for reconstructing full face images and 12.9 ± 0.1 for bird images.

Implications and Future Work

The ability to conditionally generate images from high-level attributes has significant implications. This can aid in forensic art, photo editing, and creative industries where controlled generation of images is valuable. Furthermore, learning a disentangled representation is beneficial for interpretability and manipulation of image-level features post-training.

Future directions could involve:

Enhancing the granularity of attributes to generate finer details.
Extending the layered model to dynamic images, such as videos.
Integrating adversarial training to improve the visual fidelity of generated images.

Conclusion

"Attribute2Image" presents substantial advancements in the domain of conditional image generation, leveraging the compositional nature of images through layered generative models and variational auto-encoders. The structured model with disentangled latent spaces shows promising results in generating images conditioned on high-level attributes, outperforming traditional models in both quality and diversity. The implications of this work are broad, paving the way for practical applications in various fields involving image synthesis.

PDF Markdown

Related Papers

YouTube

Show All Videos