SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

Published 19 Mar 2024 in cs.CV | (2403.13064v1)

Abstract: We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train SceneScript, we generate and release a large-scale synthetic dataset called Aria Synthetic Environments consisting of 100k high-quality in-door scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for SceneScript, which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction.

Abstract PDF HTML Upgrade to Chat

Authors (14)

References (3)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces SceneScript that generates 3D scene models via autoregressive prediction of structured language commands from visual inputs.
It employs a novel transformer-based encoder-decoder architecture with multi-modal encoders, achieving state-of-the-art architectural layout estimation.
The study leverages the extensive ASE synthetic dataset to demonstrate strong generalization to real scenes and easy extensibility to new scene commands.

SceneScript: Autoregressive Prediction of Structured Scene Commands from Visual Data

Introduction

This paper introduces SceneScript, a novel approach focused on producing full 3D scene models directly from visual data inputs, articulated as a sequence of structured language commands. This method is inspired by advancements in transformers and LLMs, diverging from traditional scene representation methods like meshes or point clouds. Instead, SceneScript utilizes a scene language encoder-decoder architecture to infer these structured language commands, providing a crisp, compact, and semantically rich scene representation.

SceneScript Overview

SceneScript represents scenes using structured language commands rather than traditional geometric representations. This approach offers several benefits, including memory efficiency, interpretability, and flexibility in adapting to new scene entities or tasks by extending the command set. The core of SceneScript is an autoregressive model that predicts these commands based on encoded visual data inputs.

Aria Synthetic Environments Dataset

To facilitate the training and evaluation of SceneScript, the authors introduced the Aria Synthetic Environments (ASE) dataset. ASE is a synthetically generated dataset comprising 100,000 high-quality indoor scenes with corresponding ground truth in the format of structured language commands. This dataset is pivotal for training SceneScript, especially since it generalizes to diverse real scenes despite being trained solely on synthetic data.

Technical Approach

To encode visual data into a latent representation, SceneScript employs three variants of encoders: point clouds, posed image sets, and a combination of both. Regardless of the input modality, the encoded scene representation is then decoded into a structured language command sequence by a transformer-based decoder. Importantly, SceneScript demonstrates its flexibility and extendability by adopting a novel scene representation based on structured language commands. For instance, introducing a command for doors with parameters for opening angles exemplifies how easily SceneScript can adapt to new scene entities or tasks.

Results and Evaluation

SceneScript shows state-of-the-art performance in architectural layout estimation and competitive results in 3D object detection on the ASE dataset. The method's generalization capability is highlighted by its performance on real scenes. Significantly, SceneScript's design facilitates easy adaptation to new tasks or commands, demonstrated by extending the structured language to include object parts or different geometric entities, like curved walls or composite entities.

Implications and Future Directions

The success of SceneScript in employing an autoregressive model for scene representation opens several avenues for future research. The method's adaptability suggests potential for integrating SceneScript with other AI systems or in applications requiring dynamic scene understanding and modification. Moreover, the introduction of ASE contributes a substantial dataset for advancing machine learning in scene understanding, promising further innovations in the field.

Conclusions

SceneScript introduces a paradigm shift in 3D scene representation, leveraging the expressive power of structured language commands for a flexible and semantically rich description of indoor environments. The extensive ASE dataset further enriches research resources, fostering advancements in scene understanding and representation. The potential for future development and integration with emerging AI technologies positions SceneScript as a foundational method in the field of generative AI for 3D modeling.

Markdown Report Issue