SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code (2403.01248v1)
Abstract: This paper introduces SceneCraft, a LLM Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
- Vision-language models as a source of rewards. CoRR, abs/2312.09187, 2023. doi: 10.48550/ARXIV.2312.09187. URL https://doi.org/10.48550/arXiv.2312.09187.
- RT-2: vision-language-action models transfer web knowledge to robotic control. CoRR, abs/2307.15818, 2023. doi: 10.48550/ARXIV.2307.15818. URL https://doi.org/10.48550/arXiv.2307.15818.
- Learning spatial knowledge for text to 3d scene generation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 2028–2038. ACL, 2014. doi: 10.3115/V1/D14-1217. URL https://doi.org/10.3115/v1/d14-1217.
- Text to 3d scene generation with rich lexical grounding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 53–62. The Association for Computer Linguistics, 2015. doi: 10.3115/V1/P15-1006. URL https://doi.org/10.3115/v1/p15-1006.
- Sceneseer: 3d scene design with natural language. CoRR, abs/1703.00050, 2017. URL http://arxiv.org/abs/1703.00050.
- Universal self-consistency for large language model generation. CoRR, abs/2311.17311, 2023. doi: 10.48550/ARXIV.2311.17311. URL https://doi.org/10.48550/arXiv.2311.17311.
- Wordseye: an automatic text-to-scene conversion system. In Pocock, L. (ed.), Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, Los Angeles, California, USA, August 12-17, 2001, pp. 487–496. ACM, 2001. doi: 10.1145/383259.383316. URL https://doi.org/10.1145/383259.383316.
- Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070, 2023. doi: 10.48550/ARXIV.2306.06070. URL https://doi.org/10.48550/arXiv.2306.06070.
- AVIS: autonomous visual information seeking with large language models. CoRR, abs/2306.08129, 2023. doi: 10.48550/ARXIV.2306.08129. URL https://doi.org/10.48550/arXiv.2306.08129.
- Videopoet: A large language model for zero-shot video generation. CoRR, abs/2312.14125, 2023. doi: 10.48550/ARXIV.2312.14125. URL https://doi.org/10.48550/arXiv.2312.14125.
- Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 300–309. IEEE, 2023. doi: 10.1109/CVPR52729.2023.00037. URL https://doi.org/10.1109/CVPR52729.2023.00037.
- Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023. doi: 10.48550/ARXIV.2308.03688. URL https://doi.org/10.48550/arXiv.2308.03688.
- Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. arXiv preprint arXiv:2311.12631, 2023.
- Language-driven synthesis of 3d scenes from scene databases. ACM Trans. Graph., 37(6):212, 2018. doi: 10.1145/3272127.3275035. URL https://doi.org/10.1145/3272127.3275035.
- Eureka: Human-level reward design via coding large language models. CoRR, abs/2310.12931, 2023. doi: 10.48550/ARXIV.2310.12931. URL https://doi.org/10.48550/arXiv.2310.12931.
- OpenAI. Gpt-4v(ision) system card. System Card, 2023. URL URL_of_the_System_Card. Version 1.0.
- Advances in data-driven analysis and synthesis of 3d indoor scenes. CoRR, abs/2304.03188, 2023. doi: 10.48550/ARXIV.2304.03188. URL https://doi.org/10.48550/arXiv.2304.03188.
- Unsupervised question decomposition for question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 8864–8880. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.EMNLP-MAIN.713. URL https://doi.org/10.18653/v1/2020.emnlp-main.713.
- Dreamfusion: Text-to-3d using 2d diffusion. CoRR, abs/2209.14988, 2022. doi: 10.48550/ARXIV.2209.14988. URL https://doi.org/10.48550/arXiv.2209.14988.
- Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
- Infinite photorealistic worlds using procedural generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 12630–12641. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01215. URL https://doi.org/10.1109/CVPR52729.2023.01215.
- Vision-language models are zero-shot reward models for reinforcement learning. CoRR, abs/2310.12921, 2023. doi: 10.48550/ARXIV.2310.12921. URL https://doi.org/10.48550/arXiv.2310.12921.
- Unsupervised traffic scene generation with synthetic 3d scene graphs. CoRR, abs/2303.08473, 2023. doi: 10.48550/ARXIV.2303.08473. URL https://doi.org/10.48550/arXiv.2303.08473.
- Real-time automatic 3d scene generation from natural language voice and text descriptions. In Nahrstedt, K., Turk, M. A., Rui, Y., Klas, W., and Mayer-Patel, K. (eds.), Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, October 23-27, 2006, pp. 61–64. ACM, 2006. doi: 10.1145/1180639.1180660. URL https://doi.org/10.1145/1180639.1180660.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Roomdreamer: Text-driven 3d indoor scene synthesis with coherent geometry and texture. In El-Saddik, A., Mei, T., Cucchiara, R., Bertini, M., Vallejo, D. P. T., Atrey, P. K., and Hossain, M. S. (eds.), Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pp. 6898–6906. ACM, 2023. doi: 10.1145/3581783.3611800. URL https://doi.org/10.1145/3581783.3611800.
- 3d-gpt: Procedural 3d modeling with large language models. CoRR, abs/2310.12945, 2023. doi: 10.48550/ARXIV.2310.12945. URL https://doi.org/10.48550/arXiv.2310.12945.
- FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rylgEULtdN.
- Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023. doi: 10.48550/ARXIV.2305.16291. URL https://doi.org/10.48550/arXiv.2305.16291.
- Lego-net: Learning regular rearrangements of objects in rooms. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 19037–19047. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01825. URL https://doi.org/10.1109/CVPR52729.2023.01825.
- GODIVA: generating open-domain videos from natural descriptions. CoRR, abs/2104.14806, 2021. URL https://arxiv.org/abs/2104.14806.
- Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024.
- Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=WZH7099tgfM.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b. URL https://webarena.dev.
- Learning the visual interpretation of sentences. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 1681–1688. IEEE Computer Society, 2013. doi: 10.1109/ICCV.2013.211. URL https://doi.org/10.1109/ICCV.2013.211.