SceneCraft: Layout-Guided 3D Generation

Updated 3 January 2026

SceneCraft is a family of data-driven 3D scene generation methods that transform explicit, user-defined layouts into detailed and physically coherent scenes.
It combines layout-guided 2D diffusion and neural radiance field optimization to ensure semantic fidelity and photorealistic textures.
The system employs explicit depth and semantic constraints with dual GPU scheduling to rapidly synthesize complex, multi-room indoor environments.

SceneCraft encompasses a family of data-driven 3D scene generation methods and systems, unified by their focus on transforming explicit user- or agent-specified layouts into physically and semantically coherent 3D scenes. The term "SceneCraft" denotes both specific frameworks for layout-guided 3D content synthesis (notably, "SceneCraft: Layout-Guided 3D Scene Generation" (Yang et al., 2024)) and, contextually, related LLM-driven scene construction agents ("SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code" (Hu et al., 2024)). SceneCraft methods address the limitations of prior text-to-3D or unstructured generative approaches by incorporating precise spatial control, semantic fidelity, and scalability to complex spaces (multi-room interiors, large asset sets) through explicit intermediate representations and rendering-supervised optimization.

1. SceneCraft Workflow: Layout-Guided 3D Scene Generation

SceneCraft (Yang et al., 2024) implements a two-stage framework designed to generate detailed 3D indoor scenes from sparse, user-editable layouts known as "bounding-box scenes" (BBS). The pipeline accepts:

A BBS—comprising axis-aligned or voxelized 3D boxes with semantic labels—representing the input arrangement of rooms and objects.
A textual style or content prompt modulating the overall scene appearance.
A camera trajectory through the scene volume.

Each BBS is rendered from the trajectory viewpoints into "bounding-box images" (BBIs) that contain per-pixel semantic (one-hot class encoding) and depth maps. These BBIs, along with the text prompt, are input into a high-fidelity, multi-view 2D diffusion model (SceneCraft2D) to synthesize layout- and style-consistent RGB images and refined depth estimations. The resulting images are then used to supervise the optimization of a neural radiance field (NeRF), adopting a distilled "in2n" (image-to-neural field) pipeline for volumetric scene representation.

2. Diffusion Model Conditioning and Objectives

SceneCraft2D augments a Stable Diffusion backbone with two ControlNets:

A semantic ControlNet for the one-hot semantic layout maps.
A depth ControlNet for the proxy BBS depth estimates.

At training, the network is fine-tuned on a corpus of ~24,000 views from the HyperSim and ScanNet++ datasets, pairing BBIs with ground-truth RGB renders. The base prompt is held constant during training to encourage adherence to geometry rather than caption artifacts; inference admits arbitrary style prompts. The denoising loss follows the standard DDPM objective,

$L_{\mathrm{diff}} = \mathbb{E}_{t,\mathbf{x}_0,\epsilon}[\|\epsilon - \epsilon_\theta(\mathbf{x}_t, t, c_{\mathrm{sem}}, c_{\mathrm{depth}})\|^2],$

where conditioning on both semantic and depth modalities is essential for aligning the generated images with both the specified layout and photorealistic textures.

3. Volumetric Rendering and 3D Optimization

The volumetric NeRF (Nerfacto from NeRFStudio) is trained using the multi-view image set synthesized by SceneCraft2D:

RGB loss penalizes deviation from 2D diffusion model predictions.
Latent-space loss (SDS-style) ensures high-level feature consistency.
Depth constraint enforces the rendered depth to reside within a threshold $\delta$ of the BBS-derived proxy depth:

$L_{\mathrm{depth}} = [\max(|D_{\mathrm{render}} - D_{\mathrm{layout}}| - \delta, 0)]^2.$

VGG-based perceptual/stylization loss consolidates texture without enforcing exact pixel-level matches, mitigating blurring.

A duo-GPU schedule assigns the 2D diffusion process and the NeRF optimization to separate devices, exploiting 6 GB and 28 GB GPU memory, respectively. The full scene generation process (e.g., 150–300 frames) requires 3–6 hours, markedly faster than panorama-based counterparts.

4. Quantitative and Qualitative Evaluation

The SceneCraft pipeline demonstrates substantial improvements over recent text-to-3D baselines (Text2Room, MVDiffusion, Set-the-scene):

Metric	SceneCraft	Text2Room	MVDiffusion	Set-the-scene
CLIP Score (CS)	24.34	23.85	22.98	21.32
3D Consistency (3DC)	3.71	3.20	3.11	3.53
Visual Quality (VQ)	3.56	3.35	3.06	2.41

SceneCraft supports multi-room, complex spatial layouts (including irregular apartments), surpassing panorama proxies that are limited to single-room or axis-aligned environments. Qualitative analysis reveals superior object placement fidelity, architectural adherence, and texture diversity, with avoidance of common artifacts such as object repetition and geometric inconsistency.

5. Key Design Ablations and Insights

SceneCraft ablation studies isolate contributions critical to performance:

Stable base prompt during diffusion model training prevents overfitting to spurious, non-layout-aligned captions and improves geometric fidelity.
The explicit layout-aware depth loss is mandatory for rapid, accurate geometry convergence; omission results in floating or misaligned objects.
The VGG texture loss is necessary to avoid excessive smoothing and blurring in the final 3D representation.

These components collectively underwrite SceneCraft's ability to translate abstract layouts and simple prompts into detailed, visually coherent 3D scenes.

A distinct thread within SceneCraft research (Hu et al., 2024) explores LLM-driven scene synthesis, in which a dual-loop agent chains scene parsing, constraint graph construction, script generation (in Blender Python), vision-language critique/refinement, and library-based self-improvement. The scene is modeled as a bipartite relational graph over assets and relations (proximity, alignment, parallelism, etc.), supporting direct mapping from high-level text queries to numerically constrained layouts, which are then rendered and iteratively refined via multimodal feedback. The agent aggregates constraint-scoring utilities through a persistent global library, continuously improving geometric reasoning.

This procedural blueprint is echoed in other work, such as FilmSceneDesigner (Xie et al., 24 Nov 2025), which implements a modular chain of role-specific LLM agents (from floorplan specification to object placement) with staged procedural geometry/material algorithms and a curated asset library. The core scene modeling concepts—scene graphs, multi-agent parameter chaining, and explicit constraint optimization—align with key architectural aspects of SceneCraft agents.

7. Comparison to Contemporaneous Methods and Future Directions

SceneCraft sits within a spectrum of recent 3D scene-generation strategies:

SceneCraft (Yang et al., 2024) utilizes NeRFs distilled from multi-view diffusion proxies with explicit BBS conditioning, excelling in textured, layout-controlled indoor scene generation.
SceneCraft agents (Hu et al., 2024) and FilmSceneDesigner (Xie et al., 24 Nov 2025) leverage LLM planners and multi-stage procedural synthesis for cinematic and asset-rich content.
Competing approaches such as SPATIALGEN (Fang et al., 18 Sep 2025) and DreamScene (Li et al., 18 Jul 2025) advance multi-modal or explicit-Gaussian representations, emphasizing direct scene-graph input, modularity, or compositional editing granularity.

Open questions remain concerning scalable asset retrieval and domain transfer, efficient NeRF/field training for real-world layouts, richer semantic inference from abstract prompts, and unified text/layout-based 3D generation for open environments. SceneCraft's explicit layout guidance, staged optimization, and 2D-to-3D distillation establish foundational paradigms for controllable and realistic virtual environment synthesis.