Learning What and Where to Draw (1610.02454v1)

Published 8 Oct 2016 in cs.CV and cs.NE

Abstract: Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. We also show preliminary results on the more challenging domain of text- and location-controllable synthesis of images of human actions on the MPII Human Pose dataset.

Citations (605)

View on Semantic Scholar

Summary

The paper introduces GAWWN, a dual-pathway GAN that disentangles content generation from spatial rendering.
The paper leverages text descriptions and spatial cues like bounding boxes and keypoints to condition and guide image synthesis.
The paper validates its approach on CUB and MHP datasets, demonstrating enhanced control and realism in generated images.

Overview of "Learning What and Where to Draw"

This paper presents the Generative Adversarial What-Where Network (GAWWN), a novel GAN-based architecture developed to address the limitations of current image synthesis models, specifically their lack of control over pose and object location. The authors propose a system that not only generates images from text descriptions but also incorporates spatial constraints to offer fine-grained control over the resulting imagery. The GAWWN is applied to datasets like Caltech-UCSD Birds (CUB) and MPII Human Pose (MHP), achieving notable advancements in generating high-quality images with specified content and locations.

Key Contributions

Model Architecture: The paper introduces a dual-pathway architecture incorporating spatial transformations. This approach separates "what" to draw from "where," allowing systematic control over image content and its positioning.
Text and Location Conditioning: The model synthesizes images at a resolution of 128x128 pixels, conditioned on detailed spatial information such as bounding boxes and keypoints. This framework enables precise placement of image components, surpassing prior methods limited to global constraints.
Keypoint-Conditional Synthesis: A novel mechanism is introduced to condition image generation on object part locations. This allows for intuitive user interfaces wherein certain object parts can be explicitly positioned, resulting in improved control and interpretability.
Text-Conditional Keypoint Generation: To reduce user effort in specifying all keypoints, the model also learns the conditional distributions of unobserved keypoints given a subset of known ones, which effectively automates part of the image setup.

Experimental Evaluation

The authors provide extensive experiments on the CUB dataset, where they achieve successful image synthesis by manipulating bird locations using either bounding boxes or keypoints. The GAWWN surpasses previous methods by producing more realistic high-resolution samples. Furthermore, the capability of the model to generate human images conditioned on textual descriptions and poses is explored on the MHP dataset.

Practical Implications: The proposed model aids applications that require high-level user control in image generation processes, such as content creation and virtual environment design. The explicit controllability over object positioning enhances the potential for interactive tools in these areas.

Theoretical Implications: This work contributes to the broader understanding of structured conditioning in adversarial models. The separation of "what" and "where" aspects emphasizes the benefit of modular representations in complex synthesis tasks.

Future Research Directions

While GAWWN marks a significant advancement in controllable image generation, several avenues remain for future research:

Scalability: Extending the model to handle more complex scenes involving multiple objects or interacting elements could open up new application domains.
Unsupervised Learning: Investigating methods for learning object or part locations without supervision would make this approach more adaptable to datasets lacking comprehensive annotations.
Higher Resolution and Realism: Further architectural innovations could push the boundaries in generating more detailed and lifelike images, particularly in domains like human imagery where current results are more variable.

In conclusion, the GAWWN framework introduced in this paper provides a substantial leap forward in the fine-grained controllability of image synthesis. The potential for practical applications and the avenues for further research underscore its value within the field of generative models and artificial intelligence.

PDF Markdown