ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary (2506.00742v1)

Published 31 May 2025 in cs.CV and cs.AI

Abstract: Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: https://artiscene-cvpr.github.io/

Summary

The paper presents ArtiScene, a novel method generating 3D scenes from text descriptions by using 2D images as intermediaries, leveraging strong text-to-image models.
Quantitative results show ArtiScene reduces object overlap significantly (6-10x), achieves higher CLIP scores, and is strongly preferred in user studies and GPT-4o evaluations.
ArtiScene reduces reliance on scarce 3D datasets, making 3D content creation more accessible for applications in VR, smart home design, and automated design processes.

Language-Driven Artistic 3D Scene Generation Through Image Intermediary

The paper presents ArtiScene, a method designed to generate 3D scenes based on textual descriptions by utilizing 2D image intermediaries. ArtiScene focuses on leveraging the progression in text-to-image models, which have demonstrated strong results regarding spatial layout and stylistic consistency due to their training on extensive 2D datasets. Utilizing these 2D images as intermediaries, ArtiScene circumvents the limitations faced by existing text-to-3D methods that require substantial 3D data for training.

The authors introduce a pipeline that combines advanced 2D synthesis capabilities with 3D scene generation, avoiding the need for additional 3D training datasets or model fine-tuning. The process begins with generating a 2D image from the scene description. Subsequent steps involve object detection and segmentation within the image to extract layout and style information, which are used to reconstruct 3D models of individual objects. These models are then positioned in a 3D scene using geometric cues from the 2D intermediary.

The quantitative results of ArtiScene, as discussed in the paper, highlight noteworthy improvements in layout and aesthetic quality compared to state-of-the-art models. The method boasts a substantial reduction in object overlapping rates by a factor of 6-10 times and yields higher CLIP scores than competing approaches. Moreover, in user studies, ArtiScene is preferred 74.89% of the time and receives a 95.07% win rate in GPT-4o evaluations.

Theoretically, ArtiScene opens new avenues for 3D scene generation by efficiently extrapolating from 2D linguistic-informed generation models to 3D representations. Practically, this method reduces the reliance on scarce 3D datasets, thus democratizing access to 3D content creation for applications in virtual reality, smart home planning, and automated design processes. This adaptability of ArtiScene to a vast array of styles and categories highlights its potential utility in creative industries and AI-assisted design fields.

The implications of this approach, evidenced by its modularity in generating independent 3D objects, extend toward enhanced editability of scenes, allowing for dynamic adjustments post-generation—an area particularly valuable for digital content creators and architects. Furthermore, by skipping the necessity for intense model training or domain-specific dataset preparation, ArtiScene can accommodate a broader spectrum of visually and functionally intricate scenes.

Future developments inspired by ArtiScene include exploring its adaptability to more complex and dynamic scenes, ensuring real-time generation and application in simulation environments. Potential research could also investigate improving speed and efficiency, as well as the incorporation of more sophisticated style transfer mechanisms to further blur the lines between visual aesthetics and functional scene layout.

In conclusion, ArtiScene represents a significant step in language-driven 3D scene generation, showcasing how intermediaries can bridge the capabilities of 2D and 3D modeling. This method highlights the growing symbiosis between different forms of AI, creating avenues for more accessible, diverse, and contextually rich digital experiences.

GitHub

ArtiScene

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary (2506.00742v1)

Summary

Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Related Papers

GitHub