Holodeck: Language Guided Generation of 3D Embodied AI Environments (2312.09067v2)

Published 14 Dec 2023 in cs.CV, cs.AI, cs.CL, and cs.RO

Abstract: 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs for styles, and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a LLM (i.e., GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly, we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI, training agents to navigate in novel scenes like music rooms and daycares without human-constructed data, which is a significant step forward in developing general-purpose embodied agents.

Authors (14)

Yue Yang (146 papers)
Fan-Yun Sun (18 papers)
Luca Weihs (46 papers)
Eli VanderBilt (10 papers)
Alvaro Herrasti (11 papers)
Winson Han (11 papers)
Jiajun Wu (249 papers)
Nick Haber (48 papers)
Ranjay Krishna (116 papers)
Lingjie Liu (79 papers)
Chris Callison-Burch (102 papers)
Mark Yatskar (38 papers)
Aniruddha Kembhavi (79 papers)
Christopher Clark (27 papers)

Citations (42)

View on Semantic Scholar

Summary

Holodeck (Yang et al., 2023 ) is a system designed to automatically generate diverse, customized, and interactive 3D environments for Embodied AI research from natural language descriptions. The creation of realistic and varied environments has traditionally been a significant bottleneck, requiring extensive manual effort or being limited by procedural generation rules and existing datasets. Holodeck addresses this by leveraging the capabilities of LLMs, specifically GPT-4, and a large collection of 3D assets from Objaverse.

The system constructs environments through a modular pipeline guided by the LLM:

Floor & Wall Module: Takes the text prompt (e.g., "a 1b1b apartment of a researcher who has a cat") and uses the LLM to propose a floor plan, including room types, dimensions (as coordinates), and suggested floor and wall materials. Material selection is done by matching LLM descriptions to a library of materials using CLIP-based similarity, incorporating color choices as well. This module handles diverse layouts and styles.
Doorway & Window Module: The LLM suggests connections between rooms (doorways, open passages) and placement of windows. It proposes door styles and window types (from a predefined set of 40 doors and 21 windows) along with parameters like size, quantity, and height, which are then implemented in the generated walls.
Object Selection Module: Based on the room types and overall prompt, the LLM proposes objects to populate the scene, providing descriptions and estimated dimensions. Holodeck retrieves suitable 3D assets from a curated library of over 51,000 assets (from Objaverse 1.0 and ProcTHOR), which have been automatically annotated with details like category, dimensions, and typical placement locations using GPT-4-Vision. The retrieval process combines visual similarity (CLIP), textual similarity (Sentence-BERT), and size discrepancy metrics.
Constraint-based Layout Design Module: Instead of having the LLM directly output absolute coordinates, which can lead to physical implausibility (collisions, out-of-bounds placements), Holodeck prompts the LLM to generate spatial relational constraints between objects (e.g., "coffee table, in front of, sofa", "near"). These constraints are categorized into Global, Distance, Position, Alignment, and Rotation types. An optimization algorithm (either a custom Depth-First-Search solver or a Mixed Integer Linear Programming solver) is used to find object placements that satisfy these soft relational constraints while strictly enforcing hard constraints like no object collisions and keeping objects within room boundaries. This approach allows for the generation of multiple plausible layouts for the same set of objects.

Integrating Objaverse assets into the AI2-THOR simulation environment requires an optimization pipeline to reduce mesh counts, generate visibility points and colliders, and handle asset loading and unloading efficiently at runtime for scalability.

Holodeck was evaluated through extensive human studies and its application to an Embodied AI task.

Human Evaluation: A comparative paper with ProcTHOR on residential scenes showed that humans significantly preferred Holodeck's scenes in terms of asset selection, layout coherence, and overall quality (64.4% overall preference for Holodeck). CLIP scores also indicated that Holodeck scenes were more visually coherent with the scene type prompt, approaching the quality of manually designed iTHOR scenes. Evaluation across 52 diverse scene types from the MIT Scenes dataset demonstrated Holodeck's ability to generate satisfactory outputs for a wide range of indoor environments, outperforming ProcTHOR's residential scenes on over half of the tested types. The layout ablation paper confirmed that the constraint-based approach generated layouts preferred by humans over baselines like direct absolute placement by LLM or random/edge placements, validating the spatial constraint method.
Embodied AI Application: Holodeck's utility was shown in training an ObjectNav agent for zero-shot navigation in novel environments. A standard ObjectNav model pretrained on ProcTHOR was finetuned on scenes automatically generated by Holodeck, prompted only by the novel scene type (e.g., "Music Room", "Daycare"). Evaluating this agent on a new human-designed benchmark, NoveltyTHOR (featuring diverse scene types and novel objects), demonstrated that finetuning on Holodeck-generated scenes significantly improved the agent's performance (Success and SPL) compared to agents trained only on ProcTHOR or a version enhanced with Holodeck's object selection but not its layout. This highlights Holodeck's ability to synthesize relevant training data for improved agent generalization in novel settings.

Limitations include potential struggles with scenes requiring highly complex or unusual layouts and reliance on the availability of specific unique assets in the Objaverse library. Cultural biases from the LLM or asset distribution can also appear, though prompting adjustments may help mitigate this.

In summary, Holodeck provides a practical, language-guided approach to generating diverse, interactive 3D environments for Embodied AI, validated by human preference and demonstrating improved agent performance in novel navigation tasks. Future work aims to expand the asset library and explore broader applications.