GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations (1907.13052v4)

Published 30 Jul 2019 in cs.LG, cs.CV, cs.NE, cs.RO, and stat.ML

Abstract: Generative latent-variable models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most state-of-the-art generative models do not explicitly capture the compositional nature of visual scenes. Two recent exceptions, MONet and IODINE, decompose scenes into objects in an unsupervised fashion. Their underlying generative processes, however, do not account for component interactions. Hence, neither of them allows for principled sampling of novel scenes. Here we present GENESIS, the first object-centric generative model of 3D visual scenes capable of both decomposing and generating scenes by capturing relationships between scene components. GENESIS parameterises a spatial GMM over images which is decoded from a set of object-centric latent variables that are either inferred sequentially in an amortised fashion or sampled from an autoregressive prior. We train GENESIS on several publicly available datasets and evaluate its performance on scene generation, decomposition, and semi-supervised learning.

Citations (296)

View on Semantic Scholar

Summary

The paper introduces GENESIS, a model that improves scene decomposition using autoregressive priors over mask variables to capture spatial dependencies.
The methodology leverages a sequential amortized inference network and the GECO approach to ensure high-quality reconstructions and robust latent representations.
Empirical evaluations on datasets like Multi-dSprites and ShapeStacks demonstrate GENESIS's superior scene generation and segmentation performance over baseline models.

An Evaluation of GENESIS: Object-Centric Generative Scene Inference and Sampling

The paper "GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations" proposes GENESIS, an advanced object-centric generative model aimed at enhancing scene decomposition and generation in robotics and reinforcement learning contexts. This summary explores the details of the GENESIS framework, its methodology, empirical assessments, and future research implications.

Overview of the GENESIS Framework

GENESIS distinguishes itself from prior object-centric models by focusing on modeling relationships between scene components. It uses an autoregressive prior over mask variables, capturing spatial dependencies in scenes, which is key to generating coherent images. GENESIS represents an advancement over models such as Monet and IODINE by enabling the generation of novel scenes with distinct spatial relationships between objects, which was not previously feasible.

The model conceptualizes an image as a spatial Gaussian Mixture Model (GMM) where each component corresponds to a distinct scene component. The importance of capturing dependencies between components is emphasized, as disregarding these can result in unrealistic scenes. The GENESIS generative process is split into mask and component latents, encoding the spatial distribution and content of the scene components, respectively.

Inference and Learning

Inference in GENESIS is achieved using an approximate posterior coupled with a sequential amortized inference network. The paper contrasts the standard variational approach (ELBO) with the Generalized ELBO with Constrained Optimization (GECO) method, which ensures model samples maintain quality by focusing the kl-divergence and ensuring a robust reconstruction constraint.

Empirical Evaluation

The empirical performance of GENESIS is assessed on various datasets, including the Multi-dSprites, GQN, and ShapeStacks datasets. The results demonstrate that GENESIS superiorly captures scene dependencies compared to other methods. Its capability to generate coherent scenes from scratch is highlighted, showcasing improved sample quality over baseline models like Monet and standard VAEs.

The paper presents detailed step-by-step scene generation and decomposition outcomes, emphasizing the coherence and semantic detail in scenes generated by GENESIS. The model presents qualitative improvements in separating and reconstructing scene components, which is further corroborated by quantitative metrics such as the Adjusted Rand Index (ARI) and segmentation covering statistics.

Downstream Application Performance

The utility of the latent representations learned by GENESIS is demonstrated in downstream classification tasks on the ShapeStacks dataset, illustrating its superior performance in tasks that rely on understanding scene physics and object interactions. This suggests that the object-centric decompositions promoted by GENESIS contribute to improved reasoning about scenes, a crucial factor in robotic and AI applications.

Implications and Future Directions

GENESIS marks a noteworthy contribution to generative modeling in AI by merging object-centric perception with relational reasoning. It opens avenues for research into more complex scene understanding tasks, such as those encountered in real-world robotics or reinforcement learning scenarios. Future work could explore scaling GENESIS to higher resolution images and more intricate scenes while minimizing computational demands.

Ultimately, GENESIS sets a foundation for further exploration into generative models that actively incorporate the complex interplay of scene components, pushing the boundaries of scene understanding toward more human-like visual perception and reasoning.