Papers
Topics
Authors
Recent
2000 character limit reached

SceneFoundry: Language-Driven 3D Synthesis

Updated 16 January 2026
  • SceneFoundry is a generative framework that synthesizes large-scale, interactive 3D apartment environments using language-guided floor-plan generation and diffusion-based sampling.
  • It integrates differentiable, physically motivated constraints to enforce object usability, prevent articulated collisions, and guarantee sufficient walkable space.
  • The approach leverages high-resolution 3D asset repositories and robust neural models to produce semantically diverse, robot-navigable scenes that advance embodied AI research.

SceneFoundry is a language-driven generative framework for synthesizing interactive, physically plausible, and large-scale 3D apartment environments, specifically designed to support embodied AI research and robotic learning. Its core innovation lies in combining language-guided floor-plan generation, diffusion-based posterior sampling, and differentiable, physically motivated constraints to produce functionally articulated and robot-navigable 3D worlds of significant structural and semantic diversity. SceneFoundry particularly addresses the need for spatial layouts containing articulated objects with movable parts and enforces explicit requirements for object usability and navigable free space, thereby setting a new standard for controllable 3D scene synthesis (Chen et al., 9 Jan 2026).

1. System Architecture and Module Composition

SceneFoundry employs a four-stage modular pipeline, illustrated in Fig. 2 of (Chen et al., 9 Jan 2026), which integrates semantic and physical principles into the generation process:

  1. LLM-Driven Floor-Plan Generator: A LLM, such as GPT-3.5, parses free-form natural language prompts (e.g., “two-bedroom apartment with a large living room and open kitchen”) and maps them to twelve simulation reward weights compatible with Infinigen’s procedural floor-plan engine. This produces a detailed undirected graph of rooms (nodes) representing types, counts, and adjacency relations, directly preserving prompt-driven semantics and enabling prompt-based layout diversity.
  2. Diffusion-Based Posterior Sampler: A scene is parameterized as NN object slots x={o1,...,oN}x = \{o_1, ..., o_N\} with oi=[li,si,θi,ci,fi]o_i = [l_i, s_i, \theta_i, c_i, f_i] capturing object location, size, orientation, class logit vector, and a VAE latent shape feature. Training utilizes an unconditional denoising diffusion probabilistic model (DDPM; details in Appendix Sec. A.1) over latent vectors. At inference, posterior sampling incorporates composite, physically motivated guidance terms ϕ(x)\phi(x) for object count and articulation.
  3. Differentiable Guidance Functions: Novel, gradient-propagated constraints regulate the number of instantiated objects (ϕquantity\phi_\mathrm{quantity}), prevent articulated-part collisions (ϕarticoll\phi_\mathrm{articoll}), and maintain a specified walkable area fraction (τ\tau) through a post-sampling adjustment loop (Algorithm 2).
  4. Asset Retrieval and Assembly: Static and articulated assets are retrieved from large-scale, semantically labeled repositories (3D-FRONT, 3D-FUTURE, GAPartNet) via nearest-neighbor search in VAE latent feature space. This ensures high-resolution, functionally annotated object geometry in the final scene assembly.

2. Diffusion-Based Posterior Sampling with Gradient Guidance

The generative backbone is an unconditional DDPM, following a standard forward process:

xt=αˉtx0+1αˉtϵ,ϵN(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The reverse process samples from:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)\right)

where constraint gradients are incorporated into the sampling trajectory:

xt1N(μθ(xt,t)+λΣθ(xt,t)xϕ(xt),Σθ(xt,t))x_{t-1} \sim \mathcal{N}\left( \mu_\theta(x_t, t) + \lambda \Sigma_\theta(x_t, t)\, \nabla_x \phi(x_t),\, \Sigma_\theta(x_t, t) \right)

(Eq. 2, (Chen et al., 9 Jan 2026)). The composite guidance %%%%10%%%% is the sum of differentiable objectives enforcing object quantity, articulation, and later walkable area. The DDPM training loss incorporates guidance during denoising:

L=Et,x0,ϵ[(ϵλΣg)ϵθ(xt,t)2]L = \mathbb{E}_{t, x_0, \epsilon}\left[\lVert (\epsilon - \lambda \Sigma g) - \epsilon_\theta(x_t, t)\rVert^2\right]

where g=xϕ(xt)g = \nabla_x \phi(x_t), enforcing that the network not only denoises but also respects structural constraints throughout the diffusion trajectory.

3. Differentiable Guidance: Physical and Semantic Constraints

SceneFoundry defines and schedules constraint-guidance terms:

  • Object Quantity Control (ϕquantity)(\phi_\mathrm{quantity}) (Eq. 3):

The sampler steers towards exactly NtargetN_\mathrm{target} active slots by minimizing the binary cross-entropy between class logits {ci}\{c_i\} and a binary mask T{0,1}NmaxT \in \{0,1\}^{N_\mathrm{max}}. The guiding gradient xϕquantity\nabla_x \phi_\mathrm{quantity} enables precise slot activation consistent with the layout plan.

  • Articulated Collision Constraint (ϕarticoll)(\phi_\mathrm{articoll}) (Eq. 4):

For each slot ii classified as articulated, bounding boxes are expanded along the axis of articulation to bib_i'. The pairwise 3D Intersection-over-Union (IoU) penalties

ϕarticoll(x)=ijIoU3D(bi,bj)\phi_\mathrm{articoll}(x) = \sum_{i \neq j} \mathrm{IoU}_\mathrm{3D}(b_i', b_j)

prevent overlap of movable elements, ensuring functional usability for manipulation tasks.

  • Walkable-Area Control:

Post-sampling refinement iteratively adjusts object sizes using a replacement loop (Algorithm 2) to guarantee that a target fraction τ\tau of the floor area is unobstructed, supporting robotic navigation.

Guidance is scheduled progressively: object quantity guidance dominates for diffusion steps t<100t < 100, articulation constraints for t<10t < 10, and walkable area is enforced post hoc at t=0t=0.

4. Asset Repositories and Scene Assembly

SceneFoundry leverages comprehensive 3D model datasets:

Asset Repository Content Purpose
3D-FRONT 14,629 furnished rooms Room types, static furniture
3D-FUTURE 16,563 models Textured static furniture
GAPartNet 1,166 objects, 8,489 parts Articulated furniture & part poses

Each populated object slot is mapped to a geometrically and semantically appropriate CAD asset via nearest-neighbor retrieval in VAE latent space. Articulated assets have part-level annotation and pose parameters, allowing for functional placement and articulation logic.

5. Natural Language Interface and Layout Semantics

Natural language prompts are interpreted by the LLM to produce Infinigen-compatible reward weight vectors {w1,...,w12}\{w_1, ..., w_{12}\} that directly steer the reward-driven simulated annealing process for floor-plan synthesis. This mapping encodes room type, count, aspect ratio, adjacency, and area ratios, ensuring the generated layouts match the semantic requirements of the prompt while incorporating geometric diversity. Adjusting penalties (e.g., low aspect-ratio penalty for free-form rooms, high penalty for orthogonal layouts) affords direct, language-driven control over global scene semantics and composition.

6. Training Protocols and Inference Performance

  • Datasets:
    • Floor-plan generator: 3D-FRONT (14,629 scenes)
    • Furniture/Asset population: 3D-FUTURE and GAPartNet
  • Optimization:

Utilizes the Adam optimizer with learning rate 2×1042 \times 10^{-4}, batch size 128, no weight decay, and gradient norm clipping at 10. Training employs a decay schedule (γ=0.5\gamma=0.5 every 20,000 iterations) for 130,000 epochs (approx. 1,500 GPU-hours on NVIDIA 3090).

  • Inference:

Generating a typical 3-room apartment (about 50 objects) with the full guidance pipeline requires approximately 300 seconds on a 3090 GPU with an i9 CPU.

7. Evaluation Metrics, Benchmarking, and Ablations

SceneFoundry employs both perceptual and controllability-oriented metrics:

  • Perceptual Metrics:
    • Fréchet Inception Distance (FID)
    • Kernel Inception Distance (KID)
    • Scene Classification Accuracy (SCA)
    • Conditional KL Divergence (CKL)
  • Controllability Metrics:
    • LLM-Guided Layout Score: Node match Snode=0.989S_\mathrm{node} = 0.989, Constraint match Scons=0.923S_\mathrm{cons} = 0.923, Edge match Sedge=0.954S_\mathrm{edge} = 0.954
    • Object Quantity Success Rate (SR): 95%\geq 95\% for Ntarget=5N_\mathrm{target} = 5–$16$
    • Articulation Collision Ratio RacollR_\mathrm{acoll}: Improved from 0.191 (unguided) to 0.109 (with ϕarticoll\phi_\mathrm{articoll})
    • Walkable-area SR(τ\tau): Outperforms baselines for τ=0.6\tau=0.6–$0.95$
  • Ablation Studies:

Remove/modify guidance terms to measure their impact on collision occurrence (Colobj\mathrm{Col}_\mathrm{obj}), walkable area ratio (RwalkableR_\mathrm{walkable}), and object reachability (RreachR_\mathrm{reach}). Each component is validated for its quantitative contribution to final scene quality and physical plausibility.

  • Visual Assessment:

Qualitative results confirm that the system reliably produces semantically coherent, functionally interactive layouts where articulated parts are unobstructed, global free space is sufficient for navigation, and perceptual fidelity matches user-specified conditions.

8. Comparative Context and Research Significance

SceneFoundry advances beyond previous scene synthesis systems, including SceneCraft (Yang et al., 2024), by explicitly incorporating robotic usability constraints (articulation collision and walkable area), yielding environments that are not only visually plausible but also directly amenable to physically grounded embodied AI experimentation. Compared to semantic and depth-guided diffusion models like SceneCraft, which focus primarily on spatial layout and style control, SceneFoundry’s integration of language, diffusion, and physical constraint optimization enables the generation of robot-manipulable, structurally varied indoor worlds at apartment scale.

A plausible implication is that SceneFoundry’s guidance framework (differentiable constraint integration in diffusion models) provides a generalizable foundation for controllable, physically meaningful scene synthesis in robotics, simulation, and embodied AI domains, while retaining extensibility for further semantic and functional constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SceneFoundry.