- The paper introduces SceneScript that generates 3D scene models via autoregressive prediction of structured language commands from visual inputs.
- It employs a novel transformer-based encoder-decoder architecture with multi-modal encoders, achieving state-of-the-art architectural layout estimation.
- The study leverages the extensive ASE synthetic dataset to demonstrate strong generalization to real scenes and easy extensibility to new scene commands.
SceneScript: Autoregressive Prediction of Structured Scene Commands from Visual Data
Introduction
This paper introduces SceneScript, a novel approach focused on producing full 3D scene models directly from visual data inputs, articulated as a sequence of structured language commands. This method is inspired by advancements in transformers and LLMs, diverging from traditional scene representation methods like meshes or point clouds. Instead, SceneScript utilizes a scene language encoder-decoder architecture to infer these structured language commands, providing a crisp, compact, and semantically rich scene representation.
SceneScript Overview
SceneScript represents scenes using structured language commands rather than traditional geometric representations. This approach offers several benefits, including memory efficiency, interpretability, and flexibility in adapting to new scene entities or tasks by extending the command set. The core of SceneScript is an autoregressive model that predicts these commands based on encoded visual data inputs.
Aria Synthetic Environments Dataset
To facilitate the training and evaluation of SceneScript, the authors introduced the Aria Synthetic Environments (ASE) dataset. ASE is a synthetically generated dataset comprising 100,000 high-quality indoor scenes with corresponding ground truth in the format of structured language commands. This dataset is pivotal for training SceneScript, especially since it generalizes to diverse real scenes despite being trained solely on synthetic data.
Technical Approach
To encode visual data into a latent representation, SceneScript employs three variants of encoders: point clouds, posed image sets, and a combination of both. Regardless of the input modality, the encoded scene representation is then decoded into a structured language command sequence by a transformer-based decoder. Importantly, SceneScript demonstrates its flexibility and extendability by adopting a novel scene representation based on structured language commands. For instance, introducing a command for doors with parameters for opening angles exemplifies how easily SceneScript can adapt to new scene entities or tasks.
Results and Evaluation
SceneScript shows state-of-the-art performance in architectural layout estimation and competitive results in 3D object detection on the ASE dataset. The method's generalization capability is highlighted by its performance on real scenes. Significantly, SceneScript's design facilitates easy adaptation to new tasks or commands, demonstrated by extending the structured language to include object parts or different geometric entities, like curved walls or composite entities.
Implications and Future Directions
The success of SceneScript in employing an autoregressive model for scene representation opens several avenues for future research. The method's adaptability suggests potential for integrating SceneScript with other AI systems or in applications requiring dynamic scene understanding and modification. Moreover, the introduction of ASE contributes a substantial dataset for advancing machine learning in scene understanding, promising further innovations in the field.
Conclusions
SceneScript introduces a paradigm shift in 3D scene representation, leveraging the expressive power of structured language commands for a flexible and semantically rich description of indoor environments. The extensive ASE dataset further enriches research resources, fostering advancements in scene understanding and representation. The potential for future development and integration with emerging AI technologies positions SceneScript as a foundational method in the field of generative AI for 3D modeling.