Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3D Scene Generation: A Survey (2505.05474v1)

Published 8 May 2025 in cs.CV

Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/hzxie/Awesome-3D-Scene-Generation.

Summary

  • The paper presents a comprehensive review of four 3D scene generation paradigms, emphasizing their methodologies and trade-offs.
  • It details key 3D representations, generative models, datasets, and evaluation metrics, providing actionable insights for diverse applications.
  • It identifies challenges and future directions, advocating for enhanced fidelity, controllability, and interactive scene synthesis.

This paper, "3D Scene Generation: A Survey" (2505.05474), provides a comprehensive overview of the state-of-the-art in synthesizing realistic, spatially structured, and semantically meaningful 3D environments. The authors highlight the growing importance of 3D scene generation across diverse fields like immersive media, gaming, robotics, autonomous driving, and embodied AI, while also emphasizing the unique challenges it presents compared to object or avatar generation, particularly in terms of scale, structural complexity, data availability, and fine-grained control.

The survey is structured into four main paradigms for 3D scene generation: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. It also covers the foundational concepts, relevant datasets, evaluation methodologies, downstream applications, and discusses future challenges and directions.

Preliminaries

The paper first establishes the task definition: generating a 3D scene representation S\mathbf{S} from an input x\mathbf{x} (noise, text, image, etc.) using a generative model G\mathcal{G}. It then reviews key 3D scene representations:

  • Voxel Grid: A 3D array storing properties, suitable for structured volumetric data.
  • Point Cloud: An unordered set of 3D points, sparse and memory-efficient.
  • Mesh: Defines surfaces through vertices, edges, and faces, providing explicit connectivity.
  • Neural Fields (NeRF, SDF): Continuous implicit functions parameterized by neural networks, enabling high-fidelity rendering but often computationally intensive.
  • 3D Gaussians: Represent scenes as a set of anisotropic Gaussians with properties, offering efficient rendering.
  • Image Sequence: A set of images from different viewpoints, implicitly encoding 3D structure and used in image/video-based methods.

Common generative models used in this domain are also summarized:

  • Autoregressive Models: Generate data sequentially, conditioning on previous elements.
  • VAEs: Encode data into a probabilistic latent space for smooth variations, but can yield blurry outputs.
  • GANs: Use a generator and discriminator in adversarial training, capable of high realism but prone to training instability.
  • Diffusion Models: Gradually add and then denoise data, producing high-quality results but computationally expensive.
  • Procedural Generators: Synthesize scenes through iterative application of parametric rules and mathematical operations, offering control and scalability but limited diversity without tuning.

Methods: A Hierarchical Taxonomy

The core of the survey lies in its classification of methods:

  1. Procedural Generation: Creates scenes using predefined rules, constraints, or LLM guidance. These methods excel in efficiency and spatial consistency but may lack realism and require tuning.
    • Rule-based Generation: Uses explicit algorithms like fractals, grammars (e.g., L-systems for CityEngine [DBLP:conf/siggraph/ParishM01]), or simulations (erosion, plant growth, city evolution) to generate geometry. Example-based methods improve controllability by expanding from user examples.
    • Optimization-based Generation: Formulates scene synthesis as minimizing cost functions encoding physical (e.g., collision) or layout constraints. Uses stochastic optimization or learns statistical patterns for object arrangement (e.g., using Bayesian networks or graph-based models). Systems like ProcTHOR [DBLP:conf/nips/DeitkeVHWESHKKM22] and Infinigen Indoors [DBLP:conf/cvpr/RaistrickMKY0HW24] allow user-defined constraints.
    • LLM-based Generation: Leverages LLMs to enable text-driven synthesis. LLMs can generate scene layouts (object parameters, scene graphs [DBLP:conf/nips/FengZFJAHBWW23]) for retrieval/generation or act as agents controlling procedural software (e.g., generating scripts for Blender [DBLP:conf/3dim/abs-2310-12945]).
  2. Neural 3D-based Generation: Employs 3D-aware generative architectures to synthesize intermediate representations or directly generate 3D structures. They offer high view and semantic consistency but face limitations in controllability and efficiency.
    • Scene Parameters: Generates compact object parameters (location, size, class) as an intermediate step before retrieval or generation. Uses CNNs, transformers (e.g., ATISS [DBLP:conf/nips/PaschalidouKSKG21]), or diffusion models (e.g., DiffuScene [DBLP:conf/cvpr/TangNMDTN24]). Text prompts or human motion (e.g., MIME [DBLP:conf/cvpr/YiHTHTB23]) can condition generation.
    • Scene Graph: Uses structured graphs (nodes=objects, edges=relationships) as blueprints. Graph-based VAEs (Graph-to-3D [DBLP:conf/iccv/DhamoMNT21]) or diffusion models (CommonScenes [DBLP:conf/nips/ZhaiOWDTNB23]) generate graphs that guide 3D creation via retrieval or shape generation. Text inputs can be integrated (InstructScene [DBLP:conf/iclr/LinM24]).
    • Semantic Layout: Employs 2D (top-down semantic maps, height maps) or 3D (voxels, bounding boxes) layouts as guidance. Generative models (GANs, diffusion) synthesize 3D representations (NeRF, SDF, voxel grids) conditioned on these layouts (e.g., CC3D [DBLP:conf/iccv/BahmaniPPYWGT23], BlockFusion [DBLP:journals/tog/WuLYSSWCLSLJ24]). Can integrate text guidance using SDS.
    • Implicit Layout: Learns latent feature maps encoding spatial structure. Encoders map scene info to latent space, decoders generate 3D scenes (NeRF, 3D Gaussians, voxels). Uses GANs (GSN [DBLP:conf/iccv/DeVries0STS21]), VAEs, or diffusion models (GAUDI [DBLP:conf/nips/0001GATTCDZGUDS22], Director3D [DBLP:conf/nips/LiLXQCZ0J24]). Handles computational challenges with voxel grids via hierarchical approaches.
  3. Image-based Generation: Bridges 2D and 3D by using 2D image generators, sometimes with 3D reconstruction. Offers photorealism and diversity but struggles with depth accuracy and long-range consistency.
    • Holistic Generation: Creates entire scene images (often panoramas) in one step using GANs or diffusion models (e.g., MVDiffusion [DBLP:conf/nips/TangZCWF23], PanFusion [DBLP:conf/cvpr/ZhangWGH0O024]). Text-to-image models enable text-driven panorama generation, addressing boundary continuity issues. Can be followed by 3D reconstruction (NeRF, 3D Gaussians) for multi-view exploration.
    • Iterative Generation: Starts with an initial image and progressively extrapolates the scene along a trajectory by warping and outpainting images ("render-refine-repeat"). Uses GANs (Infinite Nature [DBLP:conf/iccv/LiuM0SJK21]) or diffusion models (DiffDreamer [DBLP:conf/iccv/CaiCPSOGW23]). Can integrate depth estimation and 3D representations (meshes, NeRFs [DBLP:journals/tvcg/ZhangLWWL24], 3D Gaussians [DBLP:journals/corr/abs-2311-13384]) for consistency. Text guidance is common.
  4. Video-based Generation: Leverages video diffusion models to produce image sequences, enabling dynamic environments. High realism and diversity, benefiting from temporal coherence, but challenged by view alignment.
    • Two-stage Generation: Generates multi-view videos in two steps, aiming for spatial consistency then temporal coherence. May optimize a dynamic 3D representation (4D Gaussians) afterwards (e.g., 4Real [DBLP:conf/nips/YuWZMSCJT024], DimensionX [DBLP:journals/corr/abs-2411-04928]).
    • One-stage Generation: Consolidates generation into a single process, implicitly capturing spatio-temporal consistency to produce videos from any viewpoint/timestep (e.g., GenX\mathcal{X}D [DBLP:journals/corr/abs-2411-02319]). Can optimize dynamic 3D representations (4D Gaussians [DBLP:conf/iclr/abs-2406-13527]). Important for applications like autonomous driving (MagicDrive [DBLP:conf/iclr/0001CXHLY024], Vista [DBLP:conf/nips/GaoY0CQ0Z0Z024]) and gaming (DIAMOND [DBLP:conf/nips/AlonsoJMKSPF24]), where models predict future frames based on control signals or user actions.

Datasets and Evaluation

The paper provides a comprehensive overview of datasets, categorized by scene type (indoor, natural, urban) and source (real-world scans/images, synthetic from game engines/CAD models). Real-world datasets (e.g., ScanNet [DBLP:conf/cvpr/DaiCSHFN17], Matterport3D [DBLP:conf/3dim/ChangDFHNSSZZ17], KITTI [DBLP:conf/cvpr/GeigerLU12]) offer realism but limited diversity and annotations. Synthetic datasets (e.g., SUNCG [DBLP:conf/cvpr/SongYZCSF17], Structured3D [DBLP:conf/eccv/ZhengZLTGZ20], CARLA [DBLP:conf/corl/DosovitskiyRCLK17]) provide scale, diversity, and rich annotations but may lack photorealism.

Evaluation metrics are discussed across dimensions:

  • Fidelity: Using metrics like FID, KID, IS from image/video generation, sometimes adapted for 3D (F3D).
  • Spatial Consistency: Measuring depth error or camera pose error relative to pseudo ground truth or SfM reconstructions.
  • Temporal Coherence: Evaluating video stability using metrics like Flow Warping Error (FE), Frechet Video Distance (FVD), or Frechet Video Motion Distance (FVMD).
  • Controllability: Assessing alignment with text prompts using metrics like CLIP Score.
  • Diversity: Measuring distribution similarity (MMD, COV, 1-NNA) or category distributions (CKL).
  • Plausibility: Checking for physical constraints like collisions or out-of-bounds objects.

Benchmark suites (Q-Align, VideoScore, VBench, WorldScore [DBLP:journals/corr/abs-2504-00983]) are highlighted as efforts towards unified, comprehensive evaluation. Human evaluation remains crucial for subjective aspects like realism and aesthetic quality.

Applications and Tasks

3D scene generation powers various applications:

  • 3D Scene Editing: Enabling texture stylization (e.g., RoomTex [DBLP:conf/eccv/WangLXWWDZX24]) and layout arrangement/rearrangement (e.g., SceneDirector [DBLP:journals/tvcg/ZhangTLRFZ24], compositional NeRFs).
  • Human-Scene Interaction: Generating scenes compatible with human motion (e.g., Pose2Room [DBLP:conf/eccv/NieD0N22]) or scaling HSI data generation by combining human models with diverse scenes.
  • Embodied AI: Creating large-scale, diverse, and physically plausible simulation environments (e.g., ProcTHOR [DBLP:conf/nips/DeitkeVHWESHKKM22], Holodeck [DBLP:conf/cvpr/YangSWVHH0HKLCY24]).
  • Robotics: Supporting simulation-based learning of manipulation/control skills (e.g., RoboGen [DBLP:conf/icml/WangXCWWFEHG24]) or serving as world models for predicting future states (e.g., using NeRFs/dynamic Gaussians [DBLP:conf/corl/ZeYWMGY0LW23]).
  • Autonomous Driving: Creating world models for prediction and planning (e.g., using video or 3D occupancy generation) and synthesizing diverse, safety-critical scenarios for data augmentation and training.

Challenges and Future Directions

Key challenges identified are:

  • Generative Capacity: Difficulty in jointly achieving photorealism, 3D consistency, and controllability, especially for complex scenes.
  • 3D Representation: Lack of a single representation that is compact, physically meaningful, and visually realistic.
  • Data and Annotations: Limited availability of high-quality, annotated 3D scene datasets, especially with rich metadata like physics or interaction cues.
  • Evaluation: Absence of unified protocols and benchmarks that fully capture 3D geometry, physical plausibility, and diverse input conditions.

Promising future directions include:

  • Better Fidelity: Developing models that jointly reason about geometry, texture, lighting, and multi-view consistency, capturing subtle details and maintaining coherence.
  • Physical-aware Generation: Incorporating physics priors, constraints, or differentiable simulators to ensure physical plausibility for object placement, interaction, and dynamics.
  • Interactive Scene Generation: Creating scenes with interactive objects that respond meaningfully to user inputs or environmental changes.
  • Unified Perception-Generation: Developing models that integrate perception and generation capabilities, leveraging common priors and enabling bidirectional benefits for scene understanding and synthesis.

The survey concludes by emphasizing that 3D scene generation is a rapidly evolving field with significant challenges but also immense potential for advancing generative AI, 3D vision, and embodied intelligence.

(2505.05474)

Github Logo Streamline Icon: https://streamlinehq.com