SpatialGen: Layout-Guided 3D Indoor Scene Synthesis
- SpatialGen is a multi-view multi-modal latent diffusion model that generates high-fidelity 3D indoor scenes from explicit semantic layouts, enabling controllable environment synthesis.
- It employs a layout-guided diffusion process with multi-view and cross-modal alternating attention alongside a dedicated SCM-VAE to ensure strong geometric and semantic consistency.
- The model leverages a large, richly annotated synthetic dataset, achieving improved PSNR, SSIM, and FID metrics compared to prior approaches in 3D scene synthesis.
SpatialGen denotes a multi-view multi-modal latent diffusion model specifically designed for high-fidelity 3D indoor scene generation, guided by explicit 3D semantic layout constraints and leveraging a large-scale, synthetically rendered, and richly annotated dataset. The approach targets the automation of 3D scene synthesis with a focus on reconciling visual realism, geometric consistency, semantic alignment, and controllability from structured layout inputs and reference images. Its methodology, dataset, architectural innovations, benchmarking metrics, and implications for the broader research community are detailed below.
1. Model Architecture and Methodology
SpatialGen’s architecture is anchored on a latent diffusion model that is explicitly guided by a 3D semantic layout of the indoor environment. The input consists of structured 3D semantic information—encoded as collections of semantic bounding boxes parameterized by position, size, orientation, and object category—and one or more posed source images (e.g., images rendered from text prompts or manually composed references).
Key architectural components include:
- Layout-guided Diffusion Model: The semantic layout is rendered per viewpoint into a coarse semantic map and a scene coordinate map. The scene coordinate map, , is computed via
where and are the camera transformation and intrinsic matrices, and is the depth map of the layout for the -th view.
- Multi-view Latent Diffusion: The model is trained to represent the conditional distribution
where are input source views, are the target novel views, are semantic maps, are scene coordinate maps, and denotes camera poses.
- Multi-view Multi-modal Alternating Attention: The core transformer backbone alternates between (1) cross-view attention, which aggregates features across viewpoints by splicing tokens from all views, and (2) cross-modal attention within each view, which aligns features spanning modalities (color images , semantic maps , and scene coordinate maps ) for each viewpoint, ensuring both inter-view and intra-view semantic and geometric consistency.
- Scene Coordinate Map VAE (SCM-VAE): To encourage plausible and detailed geometry, a dedicated VAE is trained to reconstruct scene coordinate maps. The encoder maps a coordinate image to a latent vector ; the decoder reconstructs both the scene coordinates and a per-pixel uncertainty map . The reconstruction and multi-scale gradient losses:
are used to encourage sharp boundaries and precise geometry.
- Iterative Dense View Generation: Rather than generating all novel views simultaneously, the model produces scene coordinate maps and RGB images for subsets of views in each iteration. A global point cloud—aggregated from scene coordinate maps—is updated at each step and projected into guidance images for subsequent view synthesis cycles.
This design supports synthesis pipelines where arbitrary subset(s) of views, referring to different camera positions, are generated while recursively maintaining consistency with previously generated outputs.
2. Dataset Description
SpatialGen is trained on a custom synthetic dataset constructed to address the limitations of existing resources in both scale and annotation fidelity:
- Scale and Structure: The dataset comprises 12,328 unique scenes spanning 57,440 rooms and approximately 4.7 million panoramic RGB renderings. Each scene is further divided into rooms, each with physical layout metadata.
- Rich Annotations: For every view, there are corresponding depth maps, semantic segmentation maps, scene coordinate maps, instance segmentation masks, surface normal maps, and albedo images. Semantic layouts and structural layouts are provided for 2D and 3D, respectively.
- Viewpoint Diversity: The dataset simulates diverse camera trajectories (e.g., forward motion, inward/outward orbits, random walks) with cameras sampled at 0.5 m intervals along physically plausible, collision-free paths. This supports robust training for free-viewpoint synthesis.
- Quality Assurance: Comprehensive filtering is used to ensure viewpoint validity, lighting realism, absence of overexposures, and avoidance of camera-object collisions—yielding a training corpus with high photometric and semantic reliability.
This resource represents one of the largest structured datasets for indoor scene synthesis, addressing prior bottlenecks in dataset scarcity and annotation inconsistency.
3. Performance Metrics and Results
SpatialGen is evaluated via both quantitative and qualitative benchmarks:
- Metrics:
- CLIP Similarity: Quantifies static alignment between text prompts (if images were generated from text) and generated images.
- Image Reward: Measures alignment to human aesthetic judgments.
- PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity): Evaluate photometric and perceptual similarity of generated images to ground-truth renderings.
- FID (Fréchet Inception Distance): Assesses distributional similarity between generated and reference images.
- Experimental Results:
- SpatialGen demonstrates higher CLIP similarity and more favorable (less negative) image reward relative to baselines such as Set-the-Scene and SceneCraft, especially on combined benchmarks (Structured3D and Hypersim).
- Direct integration of layout guidance notably improves fidelity and geometric consistency, with consistent increases in PSNR and SSIM, and reductions in LPIPS and FID across all benchmark camera trajectories.
- Qualitative evaluations show that both geometry and semantics in synthesized views adhere closely to specified layouts, and visual artifacts are minimized compared to methods not leveraging explicit scene coordinate supervision.
4. Applications and Implications
SpatialGen’s design supports multiple high-impact applications:
- Interior and Architectural Design: Rapid synthesis of visually plausible, semantically annotated, and physically consistent room and scene visualizations from user-specified layouts, enabling efficient design cycles and prototyping.
- Virtual and Augmented Reality: SpatialGen-generated content enhances free-navigation and environment simulation by ensuring inter-view and inter-modal consistency, key for VR/AR immersion.
- Robotics and Embodied AI: The method enables the automatic generation of realistic, semantically diverse simulated environments critical for training navigation, perception, and manipulation policies where accurate geometry and semantic segmentation are required.
- Content Creation and Media: By conditioning on layout and example images (optionally derived from text prompts), users can iteratively refine indoor scenes, transfer styles, or synthesize entirely novel environments for visual effects and simulation.
The explicit modeling of scene coordinate geometry and modality-aligned attention provides stronger guarantees of semantic and geometric alignment than prior latent models, which suggests heightened robustness in real-world planning and navigation tasks.
5. Technical Contributions and Open Source Release
SpatialGen advances layout-guided 3D synthesis through several innovations:
- Alternating attention modules for joint cross-view and cross-modal feature fusion in the transformer backbone, achieving consistent geometry and semantics across multi-view predictions.
- Dedicated SCM-VAE to ensure high-fidelity, uncertainty-aware 3D coordinate reconstructions.
- Iterative dense-view generation, enabling efficient and scalable synthesis of large numbers of novel views with explicit multi-view feedback via point cloud fusion.
- Open-source availability of both the dataset (rich multimodal renderings, geometric, and semantic annotations) and model training/inference code, supporting transparency and reproducible research.
This combination of architectural advances, dataset scale/quality, and community resource release positions SpatialGen as a reference framework for future research in 3D indoor scene understanding, multi-modal synthesis, and controllable generative modeling.
6. Outlook and Community Impact
By addressing limitations in existing scene synthesis benchmarks and model architectures, SpatialGen lays a groundwork for further investigation in diverse areas:
- Customizable Scene Synthesis: Researchers can control scene generation at multiple levels (layout, semantics, style reference) and produce coherent 3D assets aligned with real-world specifications.
- Transfer Learning and Domain Adaptation: The explicit scene coordinate map and multi-modal alignment components support extensions to real scans, domain-adapted simulation, or photorealistic-to-simulated environment translation.
- Methodological Reproducibility: Open access to both the synthetic dataset and all training resources will facilitate the rigorous evaluation and comparison of newly proposed architectures or modalities for indoor scene generation.
In summary, SpatialGen defines a scalable, layout-guided multi-modal diffusion model and complementary dataset for generating photorealistic, semantically annotated, and geometrically consistent 3D indoor scenes. The approach establishes performance improvements over previous art and is positioned as a central resource for machine learning research in scene synthesis, robotics, and 3D vision (Fang et al., 18 Sep 2025).