BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images (2002.08988v4)

Published 20 Feb 2020 in cs.CV

Abstract: We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Current work on scene representation learning either ignores scene background or treats the whole scene as one object. Meanwhile, work that considers scene compositionality treats scene objects only as image patches or 2D layers with alpha maps. Inspired by the computer graphics pipeline, we design BlockGAN to learn to first generate 3D features of background and foreground objects, then combine them into 3D features for the wholes cene, and finally render them into realistic images. This allows BlockGAN to reason over occlusion and interaction between objects' appearance, such as shadow and lighting, and provides control over each object's 3D pose and identity, while maintaining image realism. BlockGAN is trained end-to-end, using only unlabelled single images, without the need for 3D geometry, pose labels, object masks, or multiple views of the same scene. Our experiments show that using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects (foreground and background) and their properties (pose and identity).

Citations (217)

View on Semantic Scholar

Summary

The paper introduces a novel unsupervised GAN that learns 3D object features and compositional scene representations from unlabelled images.
It employs a graphics-inspired pipeline with element-wise transformations and perspective projections to handle occlusion and object interactions.
Experimental results on synthetic and natural datasets validate its competitive KID scores and flexible object manipulation capabilities.

Analysis of BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

The paper "BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images" introduces a novel approach to generating 3D-consistent images using a GAN-based architecture that learns object-aware scene compositions directly from unlabelled 2D images. This paper addresses key limitations in existing generative models related to object compositionality and scene representation, proposing BlockGAN as a model that operates analogously to processes in computer graphics.

Using a structure inspired by traditional graphics rendering pipelines, BlockGAN learns to create 3D representations of both foreground and background objects. It subsequently combines these features into composite scene features before rendering them as realistic 2D images. This technique provides for effective reasoning about occlusion, object interactions concerning appearance (like shadow and lighting), and manipulation of object-specific 3D characteristics such as pose and identity.

Technical Contributions and Methodology

BlockGAN's architecture delineates several improvements upon contemporary approaches:

3D Object Feature Learning: BlockGAN diverges from the prevalent use of 2D composites in other models by learning 3D object features from noise vectors. This involves generating deep 3D feature grids for background and foreground elements which are then subjected to pose-specific transformations.
Scene Composition and Rendering: Unlike methods that combine image patches, BlockGAN performs element-wise operations or uses MLPs to generate deep 3D scene features. These features are transformed using a projective mechanism akin to perspective projections in graphics, enabling the model to handle complex spatial correlations and foreshortening robustly.
End-to-End Training on Unlabelled Data: The unsupervised training approach of BlockGAN utilizing single 2D images yields a framework capable of disentangling object-level representations without necessitating 3D geometry, pose labels, or multi-view images, underscoring its applicability across diverse datasets.

Results and Implications

The empirical results presented validate BlockGAN's capacity to disentangle and manipulate 3D object features convincingly. Extensive experimentation on synthetic datasets (like Synth-Car and CLEVR) and natural image datasets (e.g., Real-Car) reveal that BlockGAN can generate images with KID scores comparable or superior to those from state-of-the-art models such as LR-GAN and HoloGAN. The model exhibits commendable flexibility, supporting object addition or removal and enhanced control over individual object attributes not represented during training—critical features for applications in content generation and augmented reality.

Moreover, the authors demonstrate BlockGAN's performance on tasks like geometric modifications and multi-object scene construction beyond training limits. These capabilities suggest potential for further exploration into more nuanced learning of object interactions and compositions, possibly integrating neural relational models for enhanced inter-object dynamics.

Conclusion and Future Directions

This paper contributes significantly to the domain of unsupervised image generation by introducing a novel GAN framework that effectively learns and leverages 3D feature understanding without explicit geometric data. It opens avenues for future research in complex scene synthesis and object-specific manipulations. Exploring advancements like learning dynamic distributions of object counts and categories, or scaling to high-resolution scenes in real-time, could further enhance BlockGAN's applicability and efficiency. The integration of BlockGAN alongside models like BiGAN or ALI could also providentially impact the landscape of scene understanding and reasoning within computer vision, surpassing current limitations in disentangled representation learning.

PDF Markdown