Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code (2403.01248v1)

Published 2 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: This paper introduces SceneCraft, a LLM Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Vision-language models as a source of rewards. CoRR, abs/2312.09187, 2023. doi: 10.48550/ARXIV.2312.09187. URL https://doi.org/10.48550/arXiv.2312.09187.
  2. RT-2: vision-language-action models transfer web knowledge to robotic control. CoRR, abs/2307.15818, 2023. doi: 10.48550/ARXIV.2307.15818. URL https://doi.org/10.48550/arXiv.2307.15818.
  3. Learning spatial knowledge for text to 3d scene generation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  2028–2038. ACL, 2014. doi: 10.3115/V1/D14-1217. URL https://doi.org/10.3115/v1/d14-1217.
  4. Text to 3d scene generation with rich lexical grounding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp.  53–62. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1006. URL https://doi.org/10.3115/v1/p15-1006.
  5. Sceneseer: 3d scene design with natural language. CoRR, abs/1703.00050, 2017. URL http://arxiv.org/abs/1703.00050.
  6. Universal self-consistency for large language model generation. CoRR, abs/2311.17311, 2023. doi: 10.48550/ARXIV.2311.17311. URL https://doi.org/10.48550/arXiv.2311.17311.
  7. Wordseye: an automatic text-to-scene conversion system. In Pocock, L. (ed.), Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, Los Angeles, California, USA, August 12-17, 2001, pp.  487–496. ACM, 2001. doi: 10.1145/383259.383316. URL https://doi.org/10.1145/383259.383316.
  8. Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070, 2023. doi: 10.48550/ARXIV.2306.06070. URL https://doi.org/10.48550/arXiv.2306.06070.
  9. AVIS: autonomous visual information seeking with large language models. CoRR, abs/2306.08129, 2023. doi: 10.48550/ARXIV.2306.08129. URL https://doi.org/10.48550/arXiv.2306.08129.
  10. Videopoet: A large language model for zero-shot video generation. CoRR, abs/2312.14125, 2023. doi: 10.48550/ARXIV.2312.14125. URL https://doi.org/10.48550/arXiv.2312.14125.
  11. Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  300–309. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00037. URL https://doi.org/10.1109/CVPR52729.2023.00037.
  12. Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023. doi: 10.48550/ARXIV.2308.03688. URL https://doi.org/10.48550/arXiv.2308.03688.
  13. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. arXiv preprint arXiv:2311.12631, 2023.
  14. Language-driven synthesis of 3d scenes from scene databases. ACM Trans. Graph., 37(6):212, 2018. doi: 10.1145/3272127.3275035. URL https://doi.org/10.1145/3272127.3275035.
  15. Eureka: Human-level reward design via coding large language models. CoRR, abs/2310.12931, 2023. doi: 10.48550/ARXIV.2310.12931. URL https://doi.org/10.48550/arXiv.2310.12931.
  16. OpenAI. Gpt-4v(ision) system card. System Card, 2023. URL URL_of_the_System_Card. Version 1.0.
  17. Advances in data-driven analysis and synthesis of 3d indoor scenes. CoRR, abs/2304.03188, 2023. doi: 10.48550/ARXIV.2304.03188. URL https://doi.org/10.48550/arXiv.2304.03188.
  18. Unsupervised question decomposition for question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  8864–8880. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.713. URL https://doi.org/10.18653/v1/2020.emnlp-main.713.
  19. Dreamfusion: Text-to-3d using 2d diffusion. CoRR, abs/2209.14988, 2022. doi: 10.48550/ARXIV.2209.14988. URL https://doi.org/10.48550/arXiv.2209.14988.
  20. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  21. Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
  22. Vision-language models are zero-shot reward models for reinforcement learning. CoRR, abs/2310.12921, 2023. doi: 10.48550/ARXIV.2310.12921. URL https://doi.org/10.48550/arXiv.2310.12921.
  23. Unsupervised traffic scene generation with synthetic 3d scene graphs. CoRR, abs/2303.08473, 2023. doi: 10.48550/ARXIV.2303.08473. URL https://doi.org/10.48550/arXiv.2303.08473.
  24. Real-time automatic 3d scene generation from natural language voice and text descriptions. In Nahrstedt, K., Turk, M. A., Rui, Y., Klas, W., and Mayer-Patel, K. (eds.), Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, October 23-27, 2006, pp.  61–64. ACM, 2006. doi: 10.1145/1180639.1180660. URL https://doi.org/10.1145/1180639.1180660.
  25. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  26. Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In El-Saddik, A., Mei, T., Cucchiara, R., Bertini, M., Vallejo, D. P. T., Atrey, P. K., and Hossain, M. S. (eds.), Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pp.  6898–6906. ACM, 2023. doi: 10.1145/3581783.3611800. URL https://doi.org/10.1145/3581783.3611800.
  27. 3d-gpt: Procedural 3d modeling with large language models. CoRR, abs/2310.12945, 2023. doi: 10.48550/ARXIV.2310.12945. URL https://doi.org/10.48550/arXiv.2310.12945.
  28. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rylgEULtdN.
  29. Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023. doi: 10.48550/ARXIV.2305.16291. URL https://doi.org/10.48550/arXiv.2305.16291.
  30. Lego-net: Learning regular rearrangements of objects in rooms. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  19037–19047. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01825. URL https://doi.org/10.1109/CVPR52729.2023.01825.
  31. GODIVA: generating open-domain videos from natural descriptions. CoRR, abs/2104.14806, 2021. URL https://arxiv.org/abs/2104.14806.
  32. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
  33. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  34. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=WZH7099tgfM.
  35. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b. URL https://webarena.dev.
  36. Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp.  1681–1688. IEEE Computer Society, 2013. doi: 10.1109/ICCV.2013.211. URL https://doi.org/10.1109/ICCV.2013.211.
Citations (12)

Summary

  • The paper presents SceneCraft, a novel LLM agent that converts text descriptions into Blender-executable Python scripts for complex 3D scene rendering.
  • It integrates spatial planning with a scene graph blueprint and library learning, using vision-language models to iteratively refine asset layouts.
  • Evaluation shows over 45.1% and 40.9% improvements in CLIP scores versus BlenderGPT, highlighting its potential in game development, VR, and cinematic production.

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

The paper presents SceneCraft, an innovative LLM agent that converts text descriptions into Blender-executable Python scripts. These scripts are capable of rendering complex 3D scenes, demonstrating a sophisticated blend of spatial planning and arrangement. This is accomplished through the integration of advanced abstraction, strategic planning, and library learning mechanisms.

SceneCraft's methodology involves first modeling a scene graph as a blueprint that defines the spatial relationships among the assets within the scene. Following this, it writes Python scripts that translate these spatial relationships into numerical constraints crucial for asset layout. It then utilizes vision-language foundation models, such as GPT-V, to analyze rendered images and iteratively refine the scene output. Furthermore, SceneCraft employs a library learning mechanism that compiles commonly used script functions into a reusable library, aiding in continuous self-improvement without costly LLM parameter adjustments.

The paper's evaluation reveals that SceneCraft surpasses existing LLM-based agents in rendering complex scenes with high fidelity and constraint adherence. Notably, it exhibits over 45.1% and 40.9% improvement in generated scenes' CLIP scores compared to BlenderGPT, across both synthetic and real-world datasets. Furthermore, it achieves a significantly better constraint passing score.

The implications of this research are multifaceted. Practically, SceneCraft can revolutionize industries such as game development, virtual reality, and cinematic production by automating the conversion of text-based scene descriptions into detailed, 3D environments. Theoretically, it sets a precedent for the fusion of LLMs with 3D rendering tools, providing a novel approach to scene synthesis that circumvents the limitations of data-driven 3D object generation models.

Looking forward, this framework could be extended to reconstruct 3D scenes from existing images or videos, further broadening its application scope. Moreover, its integration with video generative models, as demonstrated with the Sintel movie dataset, hints at its potential to guide dynamic visual content creation with nuanced control.

In conclusion, SceneCraft represents a substantial advancement in text-to-3D scene synthesis. Its dual-loop self-improvement architecture, emphasizing relational graph abstraction and library learning, offers a promising pathway for future developments in AI-driven 3D content generation.

X Twitter Logo Streamline Icon: https://streamlinehq.com