SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
Abstract: This paper introduces SceneCraft, a LLM Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
- Vision-language models as a source of rewards. CoRR, abs/2312.09187, 2023. doi: 10.48550/ARXIV.2312.09187. URL https://doi.org/10.48550/arXiv.2312.09187.
- RT-2: vision-language-action models transfer web knowledge to robotic control. CoRR, abs/2307.15818, 2023. doi: 10.48550/ARXIV.2307.15818. URL https://doi.org/10.48550/arXiv.2307.15818.
- Learning spatial knowledge for text to 3d scene generation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 2028–2038. ACL, 2014. doi: 10.3115/V1/D14-1217. URL https://doi.org/10.3115/v1/d14-1217.
- Text to 3d scene generation with rich lexical grounding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 53–62. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1006. URL https://doi.org/10.3115/v1/p15-1006.
- Sceneseer: 3d scene design with natural language. CoRR, abs/1703.00050, 2017. URL http://arxiv.org/abs/1703.00050.
- Universal self-consistency for large language model generation. CoRR, abs/2311.17311, 2023. doi: 10.48550/ARXIV.2311.17311. URL https://doi.org/10.48550/arXiv.2311.17311.
- Wordseye: an automatic text-to-scene conversion system. In Pocock, L. (ed.), Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, Los Angeles, California, USA, August 12-17, 2001, pp. 487–496. ACM, 2001. doi: 10.1145/383259.383316. URL https://doi.org/10.1145/383259.383316.
- Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070, 2023. doi: 10.48550/ARXIV.2306.06070. URL https://doi.org/10.48550/arXiv.2306.06070.
- AVIS: autonomous visual information seeking with large language models. CoRR, abs/2306.08129, 2023. doi: 10.48550/ARXIV.2306.08129. URL https://doi.org/10.48550/arXiv.2306.08129.
- Videopoet: A large language model for zero-shot video generation. CoRR, abs/2312.14125, 2023. doi: 10.48550/ARXIV.2312.14125. URL https://doi.org/10.48550/arXiv.2312.14125.
- Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 300–309. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00037. URL https://doi.org/10.1109/CVPR52729.2023.00037.
- Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023. doi: 10.48550/ARXIV.2308.03688. URL https://doi.org/10.48550/arXiv.2308.03688.
- Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. arXiv preprint arXiv:2311.12631, 2023.
- Language-driven synthesis of 3d scenes from scene databases. ACM Trans. Graph., 37(6):212, 2018. doi: 10.1145/3272127.3275035. URL https://doi.org/10.1145/3272127.3275035.
- Eureka: Human-level reward design via coding large language models. CoRR, abs/2310.12931, 2023. doi: 10.48550/ARXIV.2310.12931. URL https://doi.org/10.48550/arXiv.2310.12931.
- OpenAI. Gpt-4v(ision) system card. System Card, 2023. URL URL_of_the_System_Card. Version 1.0.
- Advances in data-driven analysis and synthesis of 3d indoor scenes. CoRR, abs/2304.03188, 2023. doi: 10.48550/ARXIV.2304.03188. URL https://doi.org/10.48550/arXiv.2304.03188.
- Unsupervised question decomposition for question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 8864–8880. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.713. URL https://doi.org/10.18653/v1/2020.emnlp-main.713.
- Dreamfusion: Text-to-3d using 2d diffusion. CoRR, abs/2209.14988, 2022. doi: 10.48550/ARXIV.2209.14988. URL https://doi.org/10.48550/arXiv.2209.14988.
- Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
- Vision-language models are zero-shot reward models for reinforcement learning. CoRR, abs/2310.12921, 2023. doi: 10.48550/ARXIV.2310.12921. URL https://doi.org/10.48550/arXiv.2310.12921.
- Unsupervised traffic scene generation with synthetic 3d scene graphs. CoRR, abs/2303.08473, 2023. doi: 10.48550/ARXIV.2303.08473. URL https://doi.org/10.48550/arXiv.2303.08473.
- Real-time automatic 3d scene generation from natural language voice and text descriptions. In Nahrstedt, K., Turk, M. A., Rui, Y., Klas, W., and Mayer-Patel, K. (eds.), Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, October 23-27, 2006, pp. 61–64. ACM, 2006. doi: 10.1145/1180639.1180660. URL https://doi.org/10.1145/1180639.1180660.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In El-Saddik, A., Mei, T., Cucchiara, R., Bertini, M., Vallejo, D. P. T., Atrey, P. K., and Hossain, M. S. (eds.), Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pp. 6898–6906. ACM, 2023. doi: 10.1145/3581783.3611800. URL https://doi.org/10.1145/3581783.3611800.
- 3d-gpt: Procedural 3d modeling with large language models. CoRR, abs/2310.12945, 2023. doi: 10.48550/ARXIV.2310.12945. URL https://doi.org/10.48550/arXiv.2310.12945.
- FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rylgEULtdN.
- Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023. doi: 10.48550/ARXIV.2305.16291. URL https://doi.org/10.48550/arXiv.2305.16291.
- Lego-net: Learning regular rearrangements of objects in rooms. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 19037–19047. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01825. URL https://doi.org/10.1109/CVPR52729.2023.01825.
- GODIVA: generating open-domain videos from natural descriptions. CoRR, abs/2104.14806, 2021. URL https://arxiv.org/abs/2104.14806.
- Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
- Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=WZH7099tgfM.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b. URL https://webarena.dev.
- Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 1681–1688. IEEE Computer Society, 2013. doi: 10.1109/ICCV.2013.211. URL https://doi.org/10.1109/ICCV.2013.211.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can be implemented now by leveraging SceneCraft’s text-to-Blender-code agent, its relational scene graphs, constraint-based layout solver, and inner-loop visual feedback with a multimodal LLM.
- 3D scene blocking and layout from text for DCC tools
- Sectors: film/animation, VFX, game development, advertising
- What it enables: Rapid blockouts and first-pass layouts from natural language; auto-arrangement of up to ~100 assets with spatial constraints (alignment, proximity, parallelism).
- Tools/products/workflows: “SceneCraft for Blender” add-on; inner-loop reviewer as a design QA stage; code export for Blender scenes.
- Assumptions/dependencies: Access to GPT-4V-class models; curated asset libraries; GPU/CPU for iterative renders; asset licenses.
- Previsualization and storyboarding from script snippets
- Sectors: film/animation, episodic content, education (media production courses)
- What it enables: Convert scene directions (“a marketplace street; lamps in front of each house; hero at center”) into camera-ready blockouts; iterate via visual critiques.
- Tools/products/workflows: Script-to-subscene decomposition; per-shot constraint scripts; render passes for editorial.
- Assumptions/dependencies: Script parsing quality; shot metadata (camera, lenses) when needed; predictable naming of assets.
- Level design scaffolding and kit-bashing for games
- Sectors: game development (level design, environment art)
- What it enables: Generate modular layouts (streets, interiors, prop rows) from prompts; enforce gameplay constraints (e.g., cover spacing, sight lines) via scoring functions.
- Tools/products/workflows: Blender export to Unity/Unreal; constraint-library presets for common gameplay patterns; batch generation for ideation.
- Assumptions/dependencies: Stable FBX/GLTF pipelines; gameplay constraints encoded as SceneCraft scoring functions; engine-specific tooling.
- Product staging and virtual merchandising
- Sectors: e-commerce, retail marketing, real estate
- What it enables: Auto-place products in stylized rooms (e.g., “vase on round table near window”); bulk scene variants for A/B testing.
- Tools/products/workflows: CSV-to-scene batch generator; style prompts; CLIP-based asset retriever for SKU variants.
- Assumptions/dependencies: High-quality product assets; brand/style constraints captured as functions; rights to textures/props.
- Rapid interior and architectural concepting
- Sectors: architecture, interior design, real estate
- What it enables: Early-stage spatial studies from text; generate multiple constraint-compliant options (“desks along glass wall, lamps centered on tables”).
- Tools/products/workflows: Constraint libraries for adjacency, circulation, daylighting proxies; Blender-BIM exporters for downstream CAD.
- Assumptions/dependencies: Unit handling and real-world scale; BIM/CAD interoperability; domain validation (not code-compliant by default).
- Robotics simulation environment generation
- Sectors: robotics, autonomy, embodied AI
- What it enables: Procedurally generate household/warehouse scenes with controllable constraints for training and domain randomization.
- Tools/products/workflows: Export to Isaac Sim, Habitat, PyBullet; curriculum schedules via prompt programs; auto-logging of layout matrices as labels.
- Assumptions/dependencies: Physics-calibrated assets; sim engine bindings; safety/task constraints encoded as scoring functions.
- Synthetic data generation for computer vision
- Sectors: software/AI, autonomous systems, retail analytics
- What it enables: Programmatically produce labeled 2D/3D datasets (renders + layout matrices) for detection/segmentation/pose/layout models.
- Tools/products/workflows: “Synthetic Scene Packager” that bundles renders, masks, depth, layout matrices; CLIP-driven asset retrieval to diversify data.
- Assumptions/dependencies: Render realism and domain gap; compute for batch rendering; licensing for redistribution if needed.
- Scene-grounded control for video generation
- Sectors: media/entertainment, marketing, research on generative video
- What it enables: Use SceneCraft renders as intermediate control signals to condition video models (e.g., VideoPoet) for improved structure and layout.
- Tools/products/workflows: “Scene-to-Video Controller” that feeds rendered frames as conditioning; light finetuning on in-domain data (as shown on Sintel).
- Assumptions/dependencies: Access to, and rights for, video models; compute for finetuning; alignment between rendered scene style and video domain.
- Classroom kits for 3D geometry and code literacy
- Sectors: education (K–12, higher ed, design schools)
- What it enables: Students prompt scenes and inspect the generated Python/constraints; learn spatial reasoning and scripting by editing the library.
- Tools/products/workflows: Prebuilt lesson plans; safe asset packs; “diff viewer” to compare function updates across iterations.
- Assumptions/dependencies: Classroom hardware; moderated model access; simplified asset sets to reduce render time/cost.
- Visual design QA assistant inside DCC pipelines
- Sectors: VFX, advertising, game art
- What it enables: Use the LLM+V-Reviewer to flag violations of composition rules or brand guidelines; propose code patches to fix layouts.
- Tools/products/workflows: “Design QA Bot” panel in Blender; rule sets stored as constraint libraries; automated pass/fail reports.
- Assumptions/dependencies: Reviewer model reliability; governance for auto-applied edits; cost control for repeated critiques.
Long-Term Applications
These opportunities require further research, scaling, domain integration, or validation (e.g., larger skill libraries, open-source multimodal models, better physics and photorealism, regulatory compliance).
- End-to-end AI preproduction: multi-shot, style-consistent cinematic assembly
- Sectors: film/animation, episodic content, advertising
- What it could enable: Narrative-consistent scene generation across sequences; continuity-aware placement; automatic shot lists from scripts.
- Tools/products/workflows: Cross-shot asset/state tracking; sequence-level constraint solvers; collaborative prompt timelines.
- Assumptions/dependencies: Larger skill libraries (lighting, camera, crowd); sequence memory; IP/data governance.
- Digital twins and urban planning with conversational scene editing
- Sectors: public policy, AEC (architecture/engineering/construction), smart cities
- What it could enable: Citizens and planners iterate on streetscapes via natural language; constraints encode zoning, accessibility, traffic flow.
- Tools/products/workflows: BIM/geo integration; compliance-aware constraint functions (codes, ADA/ISO); multi-stakeholder review UIs.
- Assumptions/dependencies: Accurate GIS/BIM data; regulatory validation; auditability and provenance of generated proposals.
- Hospital and OR layout optimization via constraint learning
- Sectors: healthcare, hospital operations, medical device placement
- What it could enable: Rapid scenario generation that respects workflows (sterile corridors, equipment reachability) to test throughput and safety.
- Tools/products/workflows: Domain-specific scoring functions; integration with discrete-event simulation; VR prototyping for clinicians.
- Assumptions/dependencies: Clinical validation; privacy/security; standardized medical facility assets; liability considerations.
- Safety-critical robotics training worlds with formal guarantees
- Sectors: robotics, autonomous vehicles, industrial automation
- What it could enable: Auto-generated curricula with verifiable constraint satisfaction; adversarial scene generation for robustness testing.
- Tools/products/workflows: Formal verification hooks for layout constraints; coverage metrics; closed-loop sim-to-real toolchains.
- Assumptions/dependencies: Verified physics and sensor models; standard safety specifications; certification pathways.
- CAD/BIM-aware generative layout with code compliance checking
- Sectors: AEC, facilities management
- What it could enable: Natural-language-driven concepting that compiles to BIM; automatic code checks (egress, occupancy, fire codes) as constraints.
- Tools/products/workflows: IFC/Revit connectors; unit- and tolerance-aware solvers; “constraint library marketplace” curated by domain experts.
- Assumptions/dependencies: Rich BIM semantics; up-to-date code libraries; human-in-the-loop approvals; legal frameworks.
- Real-time conversational AR/VR world editing
- Sectors: consumer AR/VR, enterprise training, social platforms
- What it could enable: Voice-driven placement and re-layout in headsets; multiplayer co-editing with constraint consistency.
- Tools/products/workflows: On-device/lightweight multimodal models; GPU streaming; latency-aware iteration loops.
- Assumptions/dependencies: Efficient open multimodal LLMs; device compute; safety filters for shared spaces.
- Scene-grounded, production-scale video generation
- Sectors: media/entertainment, sports, education
- What it could enable: Long-form video generation where structural fidelity is governed by evolving 3D scenes (sets, crowds, choreography).
- Tools/products/workflows: Bidirectional sync between scene states and frames; differentiable renderers for tighter control; asset rights management.
- Assumptions/dependencies: Scalable video models; strong temporal consistency; high-fidelity rendering at scale.
- Academic benchmarks for spatial reasoning and code-based planning
- Sectors: academia (AI/graphics/HCI)
- What it could enable: Standardized tasks linking language, scene graphs, and constraint satisfaction; evaluation of inner/outer-loop learning in agents.
- Tools/products/workflows: Public synthetic/real datasets; reference libraries of relations; reproducible pipelines for Blender-based agents.
- Assumptions/dependencies: Open models or accessible APIs; community curation and licensing of assets; compute grants.
- Retail/warehouse micro-fulfillment layout optimization
- Sectors: logistics, retail, manufacturing
- What it could enable: “Describe operations” → candidate layouts respecting throughput, safety, and ergonomics; simulation-informed selection.
- Tools/products/workflows: Constraint libraries for aisle widths, pick-face access, robot/human paths; coupling with discrete-event sims.
- Assumptions/dependencies: Accurate demand/process data; safety regulations; integration with WMS/ERPs.
- Governance and provenance of constraint libraries
- Sectors: policy, standards bodies, platform governance
- What it could enable: Auditable, versioned libraries of spatial rules (e.g., accessibility, sustainability) with traceable updates and testing suites.
- Tools/products/workflows: Signed libraries; continuous integration for constraint tests; marketplace/review boards.
- Assumptions/dependencies: Standards adoption; legal recognition; secure distribution and licensing.
Notes on cross-cutting assumptions and dependencies
- Model access and costs: Current pipeline depends on GPT-4V-class perception and reasoning; cost and rate limits affect scalability. Viable open multimodal alternatives would broaden access.
- Asset quality and licensing: Output quality and legal deployability hinge on curated, licensed 3D asset repositories with consistent scale/orientation metadata.
- Compute and performance: Iterative render-review cycles require reliable rendering infrastructure (headless Blender) and potentially GPUs for throughput.
- Domain adaptation: Constraint functions must be extended and validated for each domain (healthcare, AEC, robotics), with expert-in-the-loop review for safety and compliance.
- Interoperability: Robust exporters (GLTF/FBX/IFC) and engine connectors (Unity/Unreal/Isaac/Revit) are critical to integrate SceneCraft into existing pipelines.
- Evaluation and QA: Beyond CLIP scores, domain-specific acceptance tests, human preference studies, and (where relevant) formal verification increase trust in deployment.
Collections
Sign up for free to add this paper to one or more collections.