Papers
Topics
Authors
Recent
Search
2000 character limit reached

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

Published 2 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG | (2403.01248v1)

Abstract: This paper introduces SceneCraft, a LLM Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Vision-language models as a source of rewards. CoRR, abs/2312.09187, 2023. doi: 10.48550/ARXIV.2312.09187. URL https://doi.org/10.48550/arXiv.2312.09187.
  2. RT-2: vision-language-action models transfer web knowledge to robotic control. CoRR, abs/2307.15818, 2023. doi: 10.48550/ARXIV.2307.15818. URL https://doi.org/10.48550/arXiv.2307.15818.
  3. Learning spatial knowledge for text to 3d scene generation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  2028–2038. ACL, 2014. doi: 10.3115/V1/D14-1217. URL https://doi.org/10.3115/v1/d14-1217.
  4. Text to 3d scene generation with rich lexical grounding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp.  53–62. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1006. URL https://doi.org/10.3115/v1/p15-1006.
  5. Sceneseer: 3d scene design with natural language. CoRR, abs/1703.00050, 2017. URL http://arxiv.org/abs/1703.00050.
  6. Universal self-consistency for large language model generation. CoRR, abs/2311.17311, 2023. doi: 10.48550/ARXIV.2311.17311. URL https://doi.org/10.48550/arXiv.2311.17311.
  7. Wordseye: an automatic text-to-scene conversion system. In Pocock, L. (ed.), Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, Los Angeles, California, USA, August 12-17, 2001, pp.  487–496. ACM, 2001. doi: 10.1145/383259.383316. URL https://doi.org/10.1145/383259.383316.
  8. Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070, 2023. doi: 10.48550/ARXIV.2306.06070. URL https://doi.org/10.48550/arXiv.2306.06070.
  9. AVIS: autonomous visual information seeking with large language models. CoRR, abs/2306.08129, 2023. doi: 10.48550/ARXIV.2306.08129. URL https://doi.org/10.48550/arXiv.2306.08129.
  10. Videopoet: A large language model for zero-shot video generation. CoRR, abs/2312.14125, 2023. doi: 10.48550/ARXIV.2312.14125. URL https://doi.org/10.48550/arXiv.2312.14125.
  11. Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  300–309. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00037. URL https://doi.org/10.1109/CVPR52729.2023.00037.
  12. Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023. doi: 10.48550/ARXIV.2308.03688. URL https://doi.org/10.48550/arXiv.2308.03688.
  13. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. arXiv preprint arXiv:2311.12631, 2023.
  14. Language-driven synthesis of 3d scenes from scene databases. ACM Trans. Graph., 37(6):212, 2018. doi: 10.1145/3272127.3275035. URL https://doi.org/10.1145/3272127.3275035.
  15. Eureka: Human-level reward design via coding large language models. CoRR, abs/2310.12931, 2023. doi: 10.48550/ARXIV.2310.12931. URL https://doi.org/10.48550/arXiv.2310.12931.
  16. OpenAI. Gpt-4v(ision) system card. System Card, 2023. URL URL_of_the_System_Card. Version 1.0.
  17. Advances in data-driven analysis and synthesis of 3d indoor scenes. CoRR, abs/2304.03188, 2023. doi: 10.48550/ARXIV.2304.03188. URL https://doi.org/10.48550/arXiv.2304.03188.
  18. Unsupervised question decomposition for question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  8864–8880. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.713. URL https://doi.org/10.18653/v1/2020.emnlp-main.713.
  19. Dreamfusion: Text-to-3d using 2d diffusion. CoRR, abs/2209.14988, 2022. doi: 10.48550/ARXIV.2209.14988. URL https://doi.org/10.48550/arXiv.2209.14988.
  20. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  21. Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
  22. Vision-language models are zero-shot reward models for reinforcement learning. CoRR, abs/2310.12921, 2023. doi: 10.48550/ARXIV.2310.12921. URL https://doi.org/10.48550/arXiv.2310.12921.
  23. Unsupervised traffic scene generation with synthetic 3d scene graphs. CoRR, abs/2303.08473, 2023. doi: 10.48550/ARXIV.2303.08473. URL https://doi.org/10.48550/arXiv.2303.08473.
  24. Real-time automatic 3d scene generation from natural language voice and text descriptions. In Nahrstedt, K., Turk, M. A., Rui, Y., Klas, W., and Mayer-Patel, K. (eds.), Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, October 23-27, 2006, pp.  61–64. ACM, 2006. doi: 10.1145/1180639.1180660. URL https://doi.org/10.1145/1180639.1180660.
  25. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  26. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In El-Saddik, A., Mei, T., Cucchiara, R., Bertini, M., Vallejo, D. P. T., Atrey, P. K., and Hossain, M. S. (eds.), Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pp.  6898–6906. ACM, 2023. doi: 10.1145/3581783.3611800. URL https://doi.org/10.1145/3581783.3611800.
  27. 3d-gpt: Procedural 3d modeling with large language models. CoRR, abs/2310.12945, 2023. doi: 10.48550/ARXIV.2310.12945. URL https://doi.org/10.48550/arXiv.2310.12945.
  28. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rylgEULtdN.
  29. Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023. doi: 10.48550/ARXIV.2305.16291. URL https://doi.org/10.48550/arXiv.2305.16291.
  30. Lego-net: Learning regular rearrangements of objects in rooms. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  19037–19047. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01825. URL https://doi.org/10.1109/CVPR52729.2023.01825.
  31. GODIVA: generating open-domain videos from natural descriptions. CoRR, abs/2104.14806, 2021. URL https://arxiv.org/abs/2104.14806.
  32. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
  33. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  34. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=WZH7099tgfM.
  35. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b. URL https://webarena.dev.
  36. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp.  1681–1688. IEEE Computer Society, 2013. doi: 10.1109/ICCV.2013.211. URL https://doi.org/10.1109/ICCV.2013.211.
Citations (12)

Summary

  • The paper presents SceneCraft, a novel LLM agent that converts text descriptions into Blender-executable Python scripts for complex 3D scene rendering.
  • It integrates spatial planning with a scene graph blueprint and library learning, using vision-language models to iteratively refine asset layouts.
  • Evaluation shows over 45.1% and 40.9% improvements in CLIP scores versus BlenderGPT, highlighting its potential in game development, VR, and cinematic production.

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

The paper presents SceneCraft, an innovative LLM agent that converts text descriptions into Blender-executable Python scripts. These scripts are capable of rendering complex 3D scenes, demonstrating a sophisticated blend of spatial planning and arrangement. This is accomplished through the integration of advanced abstraction, strategic planning, and library learning mechanisms.

SceneCraft's methodology involves first modeling a scene graph as a blueprint that defines the spatial relationships among the assets within the scene. Following this, it writes Python scripts that translate these spatial relationships into numerical constraints crucial for asset layout. It then utilizes vision-language foundation models, such as GPT-V, to analyze rendered images and iteratively refine the scene output. Furthermore, SceneCraft employs a library learning mechanism that compiles commonly used script functions into a reusable library, aiding in continuous self-improvement without costly LLM parameter adjustments.

The paper's evaluation reveals that SceneCraft surpasses existing LLM-based agents in rendering complex scenes with high fidelity and constraint adherence. Notably, it exhibits over 45.1% and 40.9% improvement in generated scenes' CLIP scores compared to BlenderGPT, across both synthetic and real-world datasets. Furthermore, it achieves a significantly better constraint passing score.

The implications of this research are multifaceted. Practically, SceneCraft can revolutionize industries such as game development, virtual reality, and cinematic production by automating the conversion of text-based scene descriptions into detailed, 3D environments. Theoretically, it sets a precedent for the fusion of LLMs with 3D rendering tools, providing a novel approach to scene synthesis that circumvents the limitations of data-driven 3D object generation models.

Looking forward, this framework could be extended to reconstruct 3D scenes from existing images or videos, further broadening its application scope. Moreover, its integration with video generative models, as demonstrated with the Sintel movie dataset, hints at its potential to guide dynamic visual content creation with nuanced control.

In conclusion, SceneCraft represents a substantial advancement in text-to-3D scene synthesis. Its dual-loop self-improvement architecture, emphasizing relational graph abstraction and library learning, offers a promising pathway for future developments in AI-driven 3D content generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be implemented now by leveraging SceneCraft’s text-to-Blender-code agent, its relational scene graphs, constraint-based layout solver, and inner-loop visual feedback with a multimodal LLM.

  • 3D scene blocking and layout from text for DCC tools
    • Sectors: film/animation, VFX, game development, advertising
    • What it enables: Rapid blockouts and first-pass layouts from natural language; auto-arrangement of up to ~100 assets with spatial constraints (alignment, proximity, parallelism).
    • Tools/products/workflows: “SceneCraft for Blender” add-on; inner-loop reviewer as a design QA stage; code export for Blender scenes.
    • Assumptions/dependencies: Access to GPT-4V-class models; curated asset libraries; GPU/CPU for iterative renders; asset licenses.
  • Previsualization and storyboarding from script snippets
    • Sectors: film/animation, episodic content, education (media production courses)
    • What it enables: Convert scene directions (“a marketplace street; lamps in front of each house; hero at center”) into camera-ready blockouts; iterate via visual critiques.
    • Tools/products/workflows: Script-to-subscene decomposition; per-shot constraint scripts; render passes for editorial.
    • Assumptions/dependencies: Script parsing quality; shot metadata (camera, lenses) when needed; predictable naming of assets.
  • Level design scaffolding and kit-bashing for games
    • Sectors: game development (level design, environment art)
    • What it enables: Generate modular layouts (streets, interiors, prop rows) from prompts; enforce gameplay constraints (e.g., cover spacing, sight lines) via scoring functions.
    • Tools/products/workflows: Blender export to Unity/Unreal; constraint-library presets for common gameplay patterns; batch generation for ideation.
    • Assumptions/dependencies: Stable FBX/GLTF pipelines; gameplay constraints encoded as SceneCraft scoring functions; engine-specific tooling.
  • Product staging and virtual merchandising
    • Sectors: e-commerce, retail marketing, real estate
    • What it enables: Auto-place products in stylized rooms (e.g., “vase on round table near window”); bulk scene variants for A/B testing.
    • Tools/products/workflows: CSV-to-scene batch generator; style prompts; CLIP-based asset retriever for SKU variants.
    • Assumptions/dependencies: High-quality product assets; brand/style constraints captured as functions; rights to textures/props.
  • Rapid interior and architectural concepting
    • Sectors: architecture, interior design, real estate
    • What it enables: Early-stage spatial studies from text; generate multiple constraint-compliant options (“desks along glass wall, lamps centered on tables”).
    • Tools/products/workflows: Constraint libraries for adjacency, circulation, daylighting proxies; Blender-BIM exporters for downstream CAD.
    • Assumptions/dependencies: Unit handling and real-world scale; BIM/CAD interoperability; domain validation (not code-compliant by default).
  • Robotics simulation environment generation
    • Sectors: robotics, autonomy, embodied AI
    • What it enables: Procedurally generate household/warehouse scenes with controllable constraints for training and domain randomization.
    • Tools/products/workflows: Export to Isaac Sim, Habitat, PyBullet; curriculum schedules via prompt programs; auto-logging of layout matrices as labels.
    • Assumptions/dependencies: Physics-calibrated assets; sim engine bindings; safety/task constraints encoded as scoring functions.
  • Synthetic data generation for computer vision
    • Sectors: software/AI, autonomous systems, retail analytics
    • What it enables: Programmatically produce labeled 2D/3D datasets (renders + layout matrices) for detection/segmentation/pose/layout models.
    • Tools/products/workflows: “Synthetic Scene Packager” that bundles renders, masks, depth, layout matrices; CLIP-driven asset retrieval to diversify data.
    • Assumptions/dependencies: Render realism and domain gap; compute for batch rendering; licensing for redistribution if needed.
  • Scene-grounded control for video generation
    • Sectors: media/entertainment, marketing, research on generative video
    • What it enables: Use SceneCraft renders as intermediate control signals to condition video models (e.g., VideoPoet) for improved structure and layout.
    • Tools/products/workflows: “Scene-to-Video Controller” that feeds rendered frames as conditioning; light finetuning on in-domain data (as shown on Sintel).
    • Assumptions/dependencies: Access to, and rights for, video models; compute for finetuning; alignment between rendered scene style and video domain.
  • Classroom kits for 3D geometry and code literacy
    • Sectors: education (K–12, higher ed, design schools)
    • What it enables: Students prompt scenes and inspect the generated Python/constraints; learn spatial reasoning and scripting by editing the library.
    • Tools/products/workflows: Prebuilt lesson plans; safe asset packs; “diff viewer” to compare function updates across iterations.
    • Assumptions/dependencies: Classroom hardware; moderated model access; simplified asset sets to reduce render time/cost.
  • Visual design QA assistant inside DCC pipelines
    • Sectors: VFX, advertising, game art
    • What it enables: Use the LLM+V-Reviewer to flag violations of composition rules or brand guidelines; propose code patches to fix layouts.
    • Tools/products/workflows: “Design QA Bot” panel in Blender; rule sets stored as constraint libraries; automated pass/fail reports.
    • Assumptions/dependencies: Reviewer model reliability; governance for auto-applied edits; cost control for repeated critiques.

Long-Term Applications

These opportunities require further research, scaling, domain integration, or validation (e.g., larger skill libraries, open-source multimodal models, better physics and photorealism, regulatory compliance).

  • End-to-end AI preproduction: multi-shot, style-consistent cinematic assembly
    • Sectors: film/animation, episodic content, advertising
    • What it could enable: Narrative-consistent scene generation across sequences; continuity-aware placement; automatic shot lists from scripts.
    • Tools/products/workflows: Cross-shot asset/state tracking; sequence-level constraint solvers; collaborative prompt timelines.
    • Assumptions/dependencies: Larger skill libraries (lighting, camera, crowd); sequence memory; IP/data governance.
  • Digital twins and urban planning with conversational scene editing
    • Sectors: public policy, AEC (architecture/engineering/construction), smart cities
    • What it could enable: Citizens and planners iterate on streetscapes via natural language; constraints encode zoning, accessibility, traffic flow.
    • Tools/products/workflows: BIM/geo integration; compliance-aware constraint functions (codes, ADA/ISO); multi-stakeholder review UIs.
    • Assumptions/dependencies: Accurate GIS/BIM data; regulatory validation; auditability and provenance of generated proposals.
  • Hospital and OR layout optimization via constraint learning
    • Sectors: healthcare, hospital operations, medical device placement
    • What it could enable: Rapid scenario generation that respects workflows (sterile corridors, equipment reachability) to test throughput and safety.
    • Tools/products/workflows: Domain-specific scoring functions; integration with discrete-event simulation; VR prototyping for clinicians.
    • Assumptions/dependencies: Clinical validation; privacy/security; standardized medical facility assets; liability considerations.
  • Safety-critical robotics training worlds with formal guarantees
    • Sectors: robotics, autonomous vehicles, industrial automation
    • What it could enable: Auto-generated curricula with verifiable constraint satisfaction; adversarial scene generation for robustness testing.
    • Tools/products/workflows: Formal verification hooks for layout constraints; coverage metrics; closed-loop sim-to-real toolchains.
    • Assumptions/dependencies: Verified physics and sensor models; standard safety specifications; certification pathways.
  • CAD/BIM-aware generative layout with code compliance checking
    • Sectors: AEC, facilities management
    • What it could enable: Natural-language-driven concepting that compiles to BIM; automatic code checks (egress, occupancy, fire codes) as constraints.
    • Tools/products/workflows: IFC/Revit connectors; unit- and tolerance-aware solvers; “constraint library marketplace” curated by domain experts.
    • Assumptions/dependencies: Rich BIM semantics; up-to-date code libraries; human-in-the-loop approvals; legal frameworks.
  • Real-time conversational AR/VR world editing
    • Sectors: consumer AR/VR, enterprise training, social platforms
    • What it could enable: Voice-driven placement and re-layout in headsets; multiplayer co-editing with constraint consistency.
    • Tools/products/workflows: On-device/lightweight multimodal models; GPU streaming; latency-aware iteration loops.
    • Assumptions/dependencies: Efficient open multimodal LLMs; device compute; safety filters for shared spaces.
  • Scene-grounded, production-scale video generation
    • Sectors: media/entertainment, sports, education
    • What it could enable: Long-form video generation where structural fidelity is governed by evolving 3D scenes (sets, crowds, choreography).
    • Tools/products/workflows: Bidirectional sync between scene states and frames; differentiable renderers for tighter control; asset rights management.
    • Assumptions/dependencies: Scalable video models; strong temporal consistency; high-fidelity rendering at scale.
  • Academic benchmarks for spatial reasoning and code-based planning
    • Sectors: academia (AI/graphics/HCI)
    • What it could enable: Standardized tasks linking language, scene graphs, and constraint satisfaction; evaluation of inner/outer-loop learning in agents.
    • Tools/products/workflows: Public synthetic/real datasets; reference libraries of relations; reproducible pipelines for Blender-based agents.
    • Assumptions/dependencies: Open models or accessible APIs; community curation and licensing of assets; compute grants.
  • Retail/warehouse micro-fulfillment layout optimization
    • Sectors: logistics, retail, manufacturing
    • What it could enable: “Describe operations” → candidate layouts respecting throughput, safety, and ergonomics; simulation-informed selection.
    • Tools/products/workflows: Constraint libraries for aisle widths, pick-face access, robot/human paths; coupling with discrete-event sims.
    • Assumptions/dependencies: Accurate demand/process data; safety regulations; integration with WMS/ERPs.
  • Governance and provenance of constraint libraries
    • Sectors: policy, standards bodies, platform governance
    • What it could enable: Auditable, versioned libraries of spatial rules (e.g., accessibility, sustainability) with traceable updates and testing suites.
    • Tools/products/workflows: Signed libraries; continuous integration for constraint tests; marketplace/review boards.
    • Assumptions/dependencies: Standards adoption; legal recognition; secure distribution and licensing.

Notes on cross-cutting assumptions and dependencies

  • Model access and costs: Current pipeline depends on GPT-4V-class perception and reasoning; cost and rate limits affect scalability. Viable open multimodal alternatives would broaden access.
  • Asset quality and licensing: Output quality and legal deployability hinge on curated, licensed 3D asset repositories with consistent scale/orientation metadata.
  • Compute and performance: Iterative render-review cycles require reliable rendering infrastructure (headless Blender) and potentially GPUs for throughput.
  • Domain adaptation: Constraint functions must be extended and validated for each domain (healthcare, AEC, robotics), with expert-in-the-loop review for safety and compliance.
  • Interoperability: Robust exporters (GLTF/FBX/IFC) and engine connectors (Unity/Unreal/Isaac/Revit) are critical to integrate SceneCraft into existing pipelines.
  • Evaluation and QA: Beyond CLIP scores, domain-specific acceptance tests, human preference studies, and (where relevant) formal verification increase trust in deployment.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 52 likes about this paper.