Image Generation from Scene Graphs (1804.01622v1)

Published 4 Apr 2018 in cs.CV and cs.LG

Abstract: To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.

PDF Abstract

Overview of "Image Generation from Scene Graphs"

Introduction

The paper "Image Generation from Scene Graphs" by Johnson et al. introduces a novel approach to image generation that leverages scene graphs to improve the fidelity and complexity of generated images. Traditional methods for text-to-image synthesis, such as StackGAN, struggle with generating images from complex sentences due to the linear structure of text. This paper proposes the use of scene graphs, which provide a structured representation of objects and their relationships, to overcome these limitations. The core contribution is a generative model that processes scene graphs using graph convolution networks (GCNs) and generates images through a cascaded refinement network (CRN). The method is validated using the Visual Genome and COCO-Stuff datasets.

Methodology

The proposed methodology involves several critical components:

Graph Convolution Network (GCN): The GCN processes the scene graph, which comprises nodes (objects) and edges (relationships), to produce embedding vectors for each object. These embeddings are iteratively refined by passing information along the edges of the graph.
Scene Layout Prediction: The object embeddings are used to predict bounding boxes and segmentation masks, forming a scene layout. This layout acts as an intermediate representation that bridges the symbolic graph and the 2D image domain.
Cascaded Refinement Network (CRN): The scene layout is converted into an image using a CRN, which processes the layout at multiple spatial scales, progressively refining the image details.
Adversarial Training: The entire model is trained adversarially using two discriminator networks—one for image patches (D_img) and another for individual objects (D_obj). These discriminators ensure the generated images are realistic and that objects are recognizable.

Experimental Setup

The model's efficacy is evaluated on two datasets:

Visual Genome (VG): This dataset provides human-annotated scene graphs, offering a diverse set of objects and relationships for testing the model's ability to generate complex images.
COCO-Stuff: This dataset includes detailed annotations of both "things" and "stuff," allowing the synthesis of scene graphs from ground-truth object positions.

Results and Analysis

The results demonstrate that the proposed method can generate images with multiple, well-localized objects. Key findings include:

Qualitative Results: The model shows a substantial ability to respect object relationships and produce semantically meaningful images. The qualitative comparisons indicate that the method can handle scenarios with multiple instances of the same object type, maintaining accurate spatial relationships.
Quantitative Metrics: The model's performance is quantified using the Inception score and object localization metrics. The proposed approach outperforms several ablated versions, emphasizing the importance of graph convolution and adversarial training.
User Studies: Two user studies conducted on Amazon Mechanical Turk quantify the perceptual quality and semantic interpretability of the generated images. The proposed method exhibits higher object recall and semantic matching accuracy compared to StackGAN, highlighting its superior ability to generate complex scenes.

Implications and Future Work

The implications of this research span both practical applications and theoretical advancements:

Practical Applications: The capability to generate images from structured scene graphs can aid artists, designers, and content creators by automating and customizing image generation. It opens up possibilities for creating tailored visual content on demand.
Theoretical Advancements: This work advances the understanding of how structured representations can improve generative models. It sets the stage for future research to explore more sophisticated graph representations and further improve object interaction modeling.

Conclusion

Johnson et al.'s paper presents a robust methodology for generating images from scene graphs, achieving significant improvements over text-based methods in handling complex scenes with multiple objects. By leveraging GCNs and CRNs, the method shows promise for both practical applications and further theoretical research. Future developments could focus on scaling the approach to higher resolutions and exploring more complex and dynamic scene graph representations.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Justin Johnson (56 papers)
Agrim Gupta (26 papers)
Li Fei-Fei (199 papers)

Citations (787)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos