SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code (2403.01248v1)

Published 2 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: This paper introduces SceneCraft, a LLM Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

References (36)

Citations (12)

View on Semantic Scholar

Summary

The paper presents SceneCraft, a novel LLM agent that converts text descriptions into Blender-executable Python scripts for complex 3D scene rendering.
It integrates spatial planning with a scene graph blueprint and library learning, using vision-language models to iteratively refine asset layouts.
Evaluation shows over 45.1% and 40.9% improvements in CLIP scores versus BlenderGPT, highlighting its potential in game development, VR, and cinematic production.

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

The paper presents SceneCraft, an innovative LLM agent that converts text descriptions into Blender-executable Python scripts. These scripts are capable of rendering complex 3D scenes, demonstrating a sophisticated blend of spatial planning and arrangement. This is accomplished through the integration of advanced abstraction, strategic planning, and library learning mechanisms.

SceneCraft's methodology involves first modeling a scene graph as a blueprint that defines the spatial relationships among the assets within the scene. Following this, it writes Python scripts that translate these spatial relationships into numerical constraints crucial for asset layout. It then utilizes vision-language foundation models, such as GPT-V, to analyze rendered images and iteratively refine the scene output. Furthermore, SceneCraft employs a library learning mechanism that compiles commonly used script functions into a reusable library, aiding in continuous self-improvement without costly LLM parameter adjustments.

The paper's evaluation reveals that SceneCraft surpasses existing LLM-based agents in rendering complex scenes with high fidelity and constraint adherence. Notably, it exhibits over 45.1% and 40.9% improvement in generated scenes' CLIP scores compared to BlenderGPT, across both synthetic and real-world datasets. Furthermore, it achieves a significantly better constraint passing score.

The implications of this research are multifaceted. Practically, SceneCraft can revolutionize industries such as game development, virtual reality, and cinematic production by automating the conversion of text-based scene descriptions into detailed, 3D environments. Theoretically, it sets a precedent for the fusion of LLMs with 3D rendering tools, providing a novel approach to scene synthesis that circumvents the limitations of data-driven 3D object generation models.

Looking forward, this framework could be extended to reconstruct 3D scenes from existing images or videos, further broadening its application scope. Moreover, its integration with video generative models, as demonstrated with the Sintel movie dataset, hints at its potential to guide dynamic visual content creation with nuanced control.

In conclusion, SceneCraft represents a substantial advancement in text-to-3D scene synthesis. Its dual-loop self-improvement architecture, emphasizing relational graph abstraction and library learning, offers a promising pathway for future developments in AI-driven 3D content generation.

PDF Markdown

Tweets

https://twitter.com/yisongyue/status/1819064063227551766

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code (2403.01248v1)

Summary

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

Related Papers

Tweets