SceneFoundry: Language-Driven 3D Synthesis
- SceneFoundry is a generative framework that synthesizes large-scale, interactive 3D apartment environments using language-guided floor-plan generation and diffusion-based sampling.
- It integrates differentiable, physically motivated constraints to enforce object usability, prevent articulated collisions, and guarantee sufficient walkable space.
- The approach leverages high-resolution 3D asset repositories and robust neural models to produce semantically diverse, robot-navigable scenes that advance embodied AI research.
SceneFoundry is a language-driven generative framework for synthesizing interactive, physically plausible, and large-scale 3D apartment environments, specifically designed to support embodied AI research and robotic learning. Its core innovation lies in combining language-guided floor-plan generation, diffusion-based posterior sampling, and differentiable, physically motivated constraints to produce functionally articulated and robot-navigable 3D worlds of significant structural and semantic diversity. SceneFoundry particularly addresses the need for spatial layouts containing articulated objects with movable parts and enforces explicit requirements for object usability and navigable free space, thereby setting a new standard for controllable 3D scene synthesis (Chen et al., 9 Jan 2026).
1. System Architecture and Module Composition
SceneFoundry employs a four-stage modular pipeline, illustrated in Fig. 2 of (Chen et al., 9 Jan 2026), which integrates semantic and physical principles into the generation process:
- LLM-Driven Floor-Plan Generator: A LLM, such as GPT-3.5, parses free-form natural language prompts (e.g., “two-bedroom apartment with a large living room and open kitchen”) and maps them to twelve simulation reward weights compatible with Infinigen’s procedural floor-plan engine. This produces a detailed undirected graph of rooms (nodes) representing types, counts, and adjacency relations, directly preserving prompt-driven semantics and enabling prompt-based layout diversity.
- Diffusion-Based Posterior Sampler: A scene is parameterized as object slots with capturing object location, size, orientation, class logit vector, and a VAE latent shape feature. Training utilizes an unconditional denoising diffusion probabilistic model (DDPM; details in Appendix Sec. A.1) over latent vectors. At inference, posterior sampling incorporates composite, physically motivated guidance terms for object count and articulation.
- Differentiable Guidance Functions: Novel, gradient-propagated constraints regulate the number of instantiated objects (), prevent articulated-part collisions (), and maintain a specified walkable area fraction () through a post-sampling adjustment loop (Algorithm 2).
- Asset Retrieval and Assembly: Static and articulated assets are retrieved from large-scale, semantically labeled repositories (3D-FRONT, 3D-FUTURE, GAPartNet) via nearest-neighbor search in VAE latent feature space. This ensures high-resolution, functionally annotated object geometry in the final scene assembly.
2. Diffusion-Based Posterior Sampling with Gradient Guidance
The generative backbone is an unconditional DDPM, following a standard forward process:
The reverse process samples from:
where constraint gradients are incorporated into the sampling trajectory:
(Eq. 2, (Chen et al., 9 Jan 2026)). The composite guidance %%%%10%%%% is the sum of differentiable objectives enforcing object quantity, articulation, and later walkable area. The DDPM training loss incorporates guidance during denoising:
where , enforcing that the network not only denoises but also respects structural constraints throughout the diffusion trajectory.
3. Differentiable Guidance: Physical and Semantic Constraints
SceneFoundry defines and schedules constraint-guidance terms:
- Object Quantity Control (Eq. 3):
The sampler steers towards exactly active slots by minimizing the binary cross-entropy between class logits and a binary mask . The guiding gradient enables precise slot activation consistent with the layout plan.
- Articulated Collision Constraint (Eq. 4):
For each slot classified as articulated, bounding boxes are expanded along the axis of articulation to . The pairwise 3D Intersection-over-Union (IoU) penalties
prevent overlap of movable elements, ensuring functional usability for manipulation tasks.
- Walkable-Area Control:
Post-sampling refinement iteratively adjusts object sizes using a replacement loop (Algorithm 2) to guarantee that a target fraction of the floor area is unobstructed, supporting robotic navigation.
Guidance is scheduled progressively: object quantity guidance dominates for diffusion steps , articulation constraints for , and walkable area is enforced post hoc at .
4. Asset Repositories and Scene Assembly
SceneFoundry leverages comprehensive 3D model datasets:
| Asset Repository | Content | Purpose |
|---|---|---|
| 3D-FRONT | 14,629 furnished rooms | Room types, static furniture |
| 3D-FUTURE | 16,563 models | Textured static furniture |
| GAPartNet | 1,166 objects, 8,489 parts | Articulated furniture & part poses |
Each populated object slot is mapped to a geometrically and semantically appropriate CAD asset via nearest-neighbor retrieval in VAE latent space. Articulated assets have part-level annotation and pose parameters, allowing for functional placement and articulation logic.
5. Natural Language Interface and Layout Semantics
Natural language prompts are interpreted by the LLM to produce Infinigen-compatible reward weight vectors that directly steer the reward-driven simulated annealing process for floor-plan synthesis. This mapping encodes room type, count, aspect ratio, adjacency, and area ratios, ensuring the generated layouts match the semantic requirements of the prompt while incorporating geometric diversity. Adjusting penalties (e.g., low aspect-ratio penalty for free-form rooms, high penalty for orthogonal layouts) affords direct, language-driven control over global scene semantics and composition.
6. Training Protocols and Inference Performance
- Datasets:
- Floor-plan generator: 3D-FRONT (14,629 scenes)
- Furniture/Asset population: 3D-FUTURE and GAPartNet
- Optimization:
Utilizes the Adam optimizer with learning rate , batch size 128, no weight decay, and gradient norm clipping at 10. Training employs a decay schedule ( every 20,000 iterations) for 130,000 epochs (approx. 1,500 GPU-hours on NVIDIA 3090).
- Inference:
Generating a typical 3-room apartment (about 50 objects) with the full guidance pipeline requires approximately 300 seconds on a 3090 GPU with an i9 CPU.
7. Evaluation Metrics, Benchmarking, and Ablations
SceneFoundry employs both perceptual and controllability-oriented metrics:
- Perceptual Metrics:
- Fréchet Inception Distance (FID)
- Kernel Inception Distance (KID)
- Scene Classification Accuracy (SCA)
- Conditional KL Divergence (CKL)
- Controllability Metrics:
- LLM-Guided Layout Score: Node match , Constraint match , Edge match
- Object Quantity Success Rate (SR): for –$16$
- Articulation Collision Ratio : Improved from 0.191 (unguided) to 0.109 (with )
- Walkable-area SR(): Outperforms baselines for –$0.95$
- Ablation Studies:
Remove/modify guidance terms to measure their impact on collision occurrence (), walkable area ratio (), and object reachability (). Each component is validated for its quantitative contribution to final scene quality and physical plausibility.
- Visual Assessment:
Qualitative results confirm that the system reliably produces semantically coherent, functionally interactive layouts where articulated parts are unobstructed, global free space is sufficient for navigation, and perceptual fidelity matches user-specified conditions.
8. Comparative Context and Research Significance
SceneFoundry advances beyond previous scene synthesis systems, including SceneCraft (Yang et al., 2024), by explicitly incorporating robotic usability constraints (articulation collision and walkable area), yielding environments that are not only visually plausible but also directly amenable to physically grounded embodied AI experimentation. Compared to semantic and depth-guided diffusion models like SceneCraft, which focus primarily on spatial layout and style control, SceneFoundry’s integration of language, diffusion, and physical constraint optimization enables the generation of robot-manipulable, structurally varied indoor worlds at apartment scale.
A plausible implication is that SceneFoundry’s guidance framework (differentiable constraint integration in diffusion models) provides a generalizable foundation for controllable, physically meaningful scene synthesis in robotics, simulation, and embodied AI domains, while retaining extensibility for further semantic and functional constraints.