Papers
Topics
Authors
Recent
2000 character limit reached

Instruction-Driven 3D Scene Generation

Updated 16 December 2025
  • Instruction-driven 3D scene generation is a method that translates high-level user instructions into detailed and semantically consistent 3D environments, enabling interactive and scalable scene design.
  • It leverages graph-based approaches, hybrid language–layout pipelines, and blueprint-driven techniques to ensure spatial coherence and semantic alignment across objects.
  • Key implementations utilize diffusion, inpainting, and neural fields to optimize object placement, collision avoidance, and realistic rendering in complex 3D scenes.

Instruction-driven 3D scene generation refers to the family of algorithms, models, and interactive systems that translate high-level user-provided instructions—such as natural language, scene graphs, semantic layouts, text+2D blueprints, or hybrid forms—into complete, spatially consistent, and semantically aligned 3D scene representations. These systems prioritize semantic controllability and user intent, enabling the generation, manipulation, and refinement of complex 3D environments without the need for traditional, manual, low-level asset editing.

1. Core Paradigms of Instruction-to-3D Scene Generation

Instruction-driven 3D scene generation research has crystallized into several architectural paradigms, distinguished by the representation of inputs and the strategies for semantic-to-geometry grounding.

  1. Graph-based approaches encode abstract scene specifications as labeled graphs, with nodes for objects (carrying category and potential style/feature labels) and edges for explicit semantic or geometric relations (such as "left_of," "support," "on_top_of") (Dhamo et al., 2021, Liu et al., 10 Mar 2025, Lin et al., 7 Feb 2024). These are propagated to 3D assets and layouts via graph convolutional or transformer-variant networks.
  2. Hybrid language–layout pipelines first parse natural language or dialogue into explicit object lists, spatial anchors, and pairwise constraints, forming either a hybrid graph (Li et al., 18 Jul 2025) or structured JSON. Collision-free, spatially coherent placements are then generated by procedural algorithms or iterative sampling, often leveraging large pretrained LLMs and vision–language fusion (Ling et al., 5 May 2025, Hao et al., 26 Sep 2025).
  3. Blueprint or layout-driven techniques use 2D bounding-box layouts, either specified by users or synthesized from text, to explicitly control spatial organization (Zhou et al., 20 Oct 2024, Chen et al., 5 Jan 2025, Yang et al., 11 Oct 2024). These blueprints are then "lifted" to 3D via inpainting, single- or multi-view guided reconstruction, and subsequent geometry/appearance refinement.
  4. Panoramic and neural field methods translate high-level text or text+layout+sketches into 2D panoramas or depth-augmented proxy images, using scene decomposition (segmentation, masking) and implicit neural fields (NeRF, 3D Gaussian splatting) to reconstruct and refine the 3D environment (Yang et al., 10 Aug 2024, Dominici et al., 25 Jun 2025, Li et al., 3 Aug 2024).

Hence, instruction-driven 3D generation encompasses a spectrum from explicit, discrete semantic control (scene graphs, blueprints) to end-to-end implicit pipelines guided by language and image–text priors.

2. Graph- and Layout-Guided 3D Scene Generation

Scene Graphs as Instruction Carriers: Early work such as Graph-to-3D (Dhamo et al., 2021) establishes the formalism where a user’s instruction is represented by a directed labeled graph G=(V,E)G = (V, E), with nodes VV for objects (category label oio_i per node) and edges EE for relations rijr_{ij} (e.g. “left_of”). Graph Convolutional Networks (GCNs) propagate object and relational context, enabling end-to-end training of a variational autoencoder (VAE) over both layout and shape. This VAE encodes both object positions, sizes, orientations (via bounding boxes, bib_i), and object-level shape codes (from pretrained mesh autoencoders). Graph-to-3D’s decoder generates 3D bounding boxes and reconstructs per-object shapes, producing scenes that respect the input graph’s semantic constraints.

Differentiable Graph Diffusion: InstructScene (Lin et al., 7 Feb 2024) employs a two-stage conditional diffusion process. The first stage generates a semantic graph (with discrete object categories, quantized features, and spatial edge relations) via a Graph Transformer diffusion prior conditioned on a CLIP embedding of the instruction. The second stage uses Gaussian diffusion to output continuous layouts (position, extent, orientation) per object, with the graph carrying all instructional context. This approach achieves iRecalliRecall of up to 74% for bedrooms, strongly outperforming autoregressive and single-stage diffusion baselines.

Hybrid Graph–Layout Synthesis: DreamScene (Li et al., 18 Jul 2025) uses a GPT-4 agent to infer a hybrid constraint graph from text or dialogue—including per-object semantic anchors and pairwise region relations—then employs a graph-based breadth-first spatial placement algorithm to synthesize a collision-free, structured 3D layout. The collision-check and anchor semantics guarantee that user-intended spatial relationships are realized.

Evaluation: Metrics such as constraint accuracy (e.g., geometric rules satisfaction), diversity (scene attribute variability across samples), and cycle-consistency (regenerate the graph from predicted scene and compare to input) are standard for these systems (Dhamo et al., 2021). Classical user studies and recall measures for instruction adherence remain central to evaluation (Lin et al., 7 Feb 2024, Li et al., 18 Jul 2025).

3. Diffusion, Inpainting, and Neural Field Mapping

Semantic–Geometry–Guided Diffusion: Layout2Scene (Chen et al., 5 Jan 2025) exploits 3D semantic layout as a precise instruction. The approach decouples object and background, representing each via canonical Gaussians or polygons, and then performs (1) semantic-guided geometry diffusion for shape/pose refinement, and (2) semantic-geometry-guided appearance diffusion. Both diffusion paths are conditioned on 2D semantic maps and normal/depth signals. The resulting optimization controls object locations and category fidelity, with geometry and appearance streaming from a single user layout. Structured fusion (per-pixel foreground/background blending and multi-stage loss terms) enforces fidelity to the input layout and text prompt.

Video and MAE-augmented Completion: Scene123 (Yang et al., 10 Aug 2024) combines warping-based view simulation, consistency-enhanced masked autoencoders (MAEs), and neural radiance field (NeRF) optimization. Starting from a reference image, adjacent views are simulated by depth-image-based rendering; unseen regions are inpainted using a codebook-augmented MAE. Multi-view photometric, depth, and transmittance priors, together with GAN-based adversarial losses against a video-diffusion generator (SVD-XT), enforce both 3D consistency and texture realism.

Blueprint Lifting and Collision-Aware Refinement: Layout-Your-3D (Zhou et al., 20 Oct 2024) leverages a 2D layout and a prompt to produce 3D Gaussian-splat scenes. A feed-forward reconstruction model (LGM) lifts individual blueprinted objects, with collision-aware layout optimization and reference-feature losses anchoring instance arrangement. Instance-wise refinement applies per-object stylization and geometric smoothing, enabling rapid generation with high object-interaction plausibility.

Panoramic and Hybrid Inpainting: DreamAnywhere (Dominici et al., 25 Jun 2025) directly generates a 360° panorama from text, uses instance segmentation to extract objects and backgrounds, and applies hybrid 2D+3D inpainting modules to handle occlusions and disocclusions. Detailed object 3D models are obtained by reference resynthesis and multiview NeRF reconstruction, followed by conversion to Gaussian splat representations. The final scene composition enables real-time navigation and localized editing, with modularity for all process steps.

4. Interactive, Hierarchical, and Task-Driven Systems

Interactive and Hierarchical Interfaces: Systems like SceneSeer (Chang et al., 2017), Canvas3D (Duan et al., 10 Aug 2025), and iControl3D (Li et al., 3 Aug 2024) provide user-facing frameworks for composition via natural language (SceneSeer), guided manipulation in a 3D canvas (Canvas3D), or stepwise 2D diffusion-guided construction (iControl3D). SceneSeer uses a spatial knowledge base extracted from a 12,500-scene corpus, combining deterministic semantic parsing and learned priors (object occurrence, supports, attachment, and relative positions) to suggest and lay out models, with incremental improvement via language commands.

Hierarchical Representation and Amodal Completion: HiScene (Dong et al., 17 Apr 2025) treats scenes as a two-level hierarchy—rooms as super-objects, with contents as manipulatable sub-objects. It employs an isometric-view generation module (text-to-diffusion), video diffusion for amodal completion (recovering occluded geometry), and shape prior injection to ensure spatial alignment. Each node in the scene graph corresponds to an object or context, supporting granular editing and full compositionality.

Task-driven Layout via Reasoning Chains: MesaTask (Hao et al., 26 Sep 2025) is specialized for task-oriented tabletop scene generation. It incorporates a large-scale annotated dataset and a chain-of-thought ("Spatial Reasoning Chain") through an LLM. Object inference infers the full set of required items for a task, spatial relations are reasoned explicitly, and a final scene graph is sampled into a physically valid 3D layout (collision checks, grid-cell assignment, asset retrieval), with Direct Preference Optimization (DPO) enforcing collision-free and relation-conforming outputs.

5. Physics-aware and Editable Scene Synthesis

Pose and Spatial Realism Optimization: Scenethesis (Ling et al., 5 May 2025) demonstrates an agentic, zero-training pipeline where an LLM drafts an initial layout, which is then refined by vision-guidance (2D diffusion, Grounded-SAM segmentation, monocular depth). An optimization module enforces geometry–image alignment via 2D/3D correspondence (RoMa), SDF-based collision penalties, and a stability constraint. A final judge module (GPT-4o) automatically validates object category presence, orientation, and spatial coherence, re-invoking layout and optimization as needed.

Editable Object-centric Representations: Both DreamScene (Li et al., 18 Jul 2025) and DreamAnywhere (Dominici et al., 25 Jun 2025) adopt object-centric Gaussian splat representations, enabling per-object manipulation post-generation. DreamScene supports direct object relocation, appearance modification (with prompt editing), and 4D motion via keyframe affine trajectory assignment. DreamAnywhere’s modular design facilitates intuitive object transformations and live background edits through inpainting and re-initialization of Gaussian splats.

Interactive Editing Pipelines: iControl3D (Li et al., 3 Aug 2024) allows users to iteratively edit content via camera/viewpoint changes, inpainting windows, and control signals (scribbles, masks, depth hints), merging 2D diffusion with mesh fusion and environment map modeling for depth-discrepant remote regions.

6. Technical Evaluation and Comparative Performance

Empirical comparisons are grounded in both perceptual and quantitative metrics.

Leading systems surpass prior art in terms of instruction-compliance, geometric/semantic consistency, and editing support. DreamScene, for example, demonstrates assembly in 1.5 hours (environment) versus 7.5–13 hours for previous state-of-the-art, with higher object R-precision and rated superiority in human studies (Li et al., 18 Jul 2025). Layout2Scene improves CLIPScore by +6.45 over layout-guided baselines, and Layout-Your-3D attains high rationality (8.7/10) and quality (8.4/10) in user studies (Chen et al., 5 Jan 2025, Zhou et al., 20 Oct 2024). Physics-aware constraints materially reduce collision and instability rates in Scenethesis (Ling et al., 5 May 2025).

7. Open Challenges and Future Directions

Remaining limitations across the field include compositional scalability (handling > 10 object instances; multi-room and outdoor scenes at urban scale); the realism gap between synthetic and real-world prompts or asset libraries; and generalized 3D asset generation beyond retrieval-based pipelines, especially for novel categories and fine-grained attributes. Handling PBR materials and relighting, expressing dynamic interactions, and integrating continuous spatial reasoning distributions over discrete cell-based systems remain open research fronts.

Future directions, such as expanding pipeline modularity (DreamAnywhere), integrating LLM-driven semantic attribute controllers (HiScene), extending from tabletop to full room/scene (MesaTask), or learning continuous spatial distributions from large corpora, promise further improvements in semantic controllability and scene authenticity.


Key references:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Instruction-driven 3D Scene Generation.