SceneDM: Dynamic Scene Modeling

Updated 24 March 2026

SceneDM is a computational framework that models dynamic scenes using generative diffusion processes, enabling realistic multi-agent simulations and semantic scene editing.
It integrates diverse components such as traffic simulation, 3D asset creation, and real-time scene graph management for comprehensive scene modeling.
SceneDM employs transformer-based denoisers, neural fields, and dual-representation geometry to achieve scalable, physically-plausible simulations with measurable improvements.

SceneDM refers to a family of computational models, primarily based on diffusion processes or neural fields, designed for the comprehensive generation, simulation, manipulation, or updating of complex dynamic environments at the scene level. These models are widely applied in scenarios requiring global coherence, multi-agent consistency, realistic physics, dynamic interactions, and flexible editing. SceneDM spans traffic world modeling for large-scale driving simulation, physically-plausible multi-agent future prediction, general 3D asset generation, and real-time semantic scene graph management.

1. Core Definitions and Problem Scope

SceneDM approaches unify various tasks required for synthetic scene simulation. The central object is a high-dimensional scene representation capturing all dynamic and static elements—vehicles, pedestrians, agents, environmental controls (e.g., traffic lights), physical layout—or, for semantic mapping, hierarchical scene graphs with typed objects and relations. The primary objectives include:

Generative modeling of physically-plausible future states, globally-consistent scene layouts, or 3D structures, often via diffusion or autoregressive models
Dynamic updates such as spawning and removing agents, modeling occlusion, or reflecting changes from multimodal observations (vision, actions, language)
Multi-agent motion simulation with explicit modeling of agent-agent interactions and adherence to map/environment constraints
Scene editing and manipulation, including object-level repositioning, deformation, and scene inpainting
End-to-end training: integrating all aspects (scene population, behavior, environmental logic) in a single loss-minimized architecture

This scope is exemplified in city-scale traffic simulation by SceneDiffuser++ (Tan et al., 27 Jun 2025), indoor/outdoor semantic graph updating (Olivastri et al., 2024), multi-modal trajectory modeling (Guo et al., 2023), vectorized driving scene generation (Rowe et al., 28 Mar 2025), and 3D scene generation (Kim et al., 2023).

2. Mathematical Foundations of SceneDM

Most SceneDM methods are built upon generative diffusion processes, characterized by iterative denoising steps in a high-dimensional latent space. Typical mathematical structures are as follows:

Scene tensor representation (for world models): The full environment is encoded as tensors,

$x_i \in \mathbb{R}^{E_i \times T \times D_i}, \quad i \in \{\text{agents, lights}\}$

where $E_i$ is the maximum slot count for each entity type, $T$ is a time horizon, and $D_i$ encodes per-slot features.

Diffusion process (Tan et al., 27 Jun 2025, Guo et al., 2023):
- Forward (noising):
$q(z_t \mid x) = \mathcal{N}(z_t \mid \alpha_t x, \sigma_t^2 I)$

with $\alpha_t$ , $\sigma_t$ following cosine or variance-preserving schedules. - Reverse (denoising): A neural network predicts either $v_t$ (v-prediction), or directly the noise vector at each step, to iteratively reconstruct $x$ .
Losses: Weighted MSE over all slots and features, often including masking for unoccupied slots or out-of-distribution elements.
Scene graphs (Olivastri et al., 2024): Formalized as $G_t = (V_t, E_t)$ , with nodes and edges typed by roles (room, object) and hierarchical relations updated via multimodal proposals and confidence fusion.
Consistency constraints: Overlapping and temporal constraints (e.g., consistent noise injection for neighbors in trajectory sequences) enforce smoothness and scene-level metric adherence (Guo et al., 2023).

Distinct architectures may include hierarchical latent diffusion (Kim et al., 2023), neural field objects (Wang et al., 2022), or dual SDF–3DGS representations (Tourani et al., 15 Oct 2025).

3. Model Architectures and Algorithms

SceneDM solutions display a diversity of architectures depending on application domain:

Transformer-based denoisers: Used for both temporal (multi-agent trajectory) and spatial (scene tensor slot) reasoning, with axial or attention blocks capturing inter-agent interactions (Tan et al., 27 Jun 2025, Guo et al., 2023).
Autoregressive inpainting and rollout: At each step, partial prediction of future states is carried out, then used as input for re-planning and simulation continuation (Tan et al., 27 Jun 2025).
Multi-tensor diffusion: Agents and environmental distractors (e.g., traffic lights) are jointly diffused by concatenating slot-embedded tensors, allowing interaction learning between object types (Tan et al., 27 Jun 2025).
Vectorized and graph-based encodings: Scene structure is directly embedded and then diffused, as in Scenario Dreamer, achieving high parameter efficiency for scene generation (Rowe et al., 28 Mar 2025), or updated in real time via scene graphs and multimodal fusion (Olivastri et al., 2024).
Neural fields (NF): SceneDM for 3D environments leverages latent NeRF-style field representations for photorealistic, editable scene synthesis and manipulation (Kim et al., 2023, Wang et al., 2022).
Dual-representation geometry: Dynamic scene modeling may combine 3D Gaussian splatting with signed distance fields (SDFs), supporting both photorealism and watertight geometry for editing and decomposition (Tourani et al., 15 Oct 2025).
Algorithmic workflows: Unified pipelines for simulation typically alternate between a world model step (diffusion), external planning/PL stack, and input/output tensor recombination (Tan et al., 27 Jun 2025).

4. SceneDM for Simulation, Trajectory Generation, and Editing

SceneDM enables several critical capabilities:

City-scale traffic simulation: SceneDiffuser++ performs end-to-end city simulation with joint agent behavior, traffic lights, dynamic spawn/remove operations, and supports closed-loop rollouts for AV stack evaluation (Tan et al., 27 Jun 2025).
Multi-agent future prediction: SceneDM generates consistent scene-level trajectories across all agent types, with scene-level scoring and filtering to enforce physical realism (e.g., collision avoidance, road adherence) (Guo et al., 2023).
Semantic graph updating: Multi-modal fusion of vision, language, robot action, and time priors enables continuous update of dynamic scene graphs for robotics environments (Olivastri et al., 2024).
3D scene generation: Hierarchical latent diffusion and neural-field decoders produce complete editable 3D environments with scene-level constraints, supporting conditional generation, inpainting, and style transfer (Kim et al., 2023).
Scene editing/decomposition: Dual SDF+3DGS methods and object fields allow object-level manipulation, duplication, insertion, and collision-aware editing within continuous 3D representation (Tourani et al., 15 Oct 2025, Wang et al., 2022).
Behavioral simulation: Return-conditioned, multi-agent autoregressive models generate closed-loop, diverse agent behaviors challenging for downstream RL planning agents (Rowe et al., 28 Mar 2025).

5. Quantitative Performance and Evaluation

Comprehensive benchmarks demonstrate SceneDM’s state-of-the-art capabilities across diverse metrics:

Traffic simulation realism: SceneDiffuser++ achieves a composite JS-divergence score of 0.29 (vs 0.48 for IDM and original SceneDiffuser), achieving 50% lower divergence on agent spawn/remove metrics, realistic agent counts, speed profiles, and traffic light state transitions (Tan et al., 27 Jun 2025).
Trajectory prediction: SceneDM (Waymo Sim Agents Benchmark) achieves realism meta-metric 0.5000–0.5060, outperforming prior methods including MTR_E and Multipath, and achieves top or near-top values on kinematic, interaction, and map adherence metrics (Guo et al., 2023).
Scene graph consistency: Multi-modal updaters achieve add success 100%, remove/move 66.7% (limiting factor: RGB-D sensor misses), and maintain event-driven, atomic update consistency (Olivastri et al., 2024).
3D scene generation: Hierarchical LDMs outperform prior state-of-the-art on FID/FVD and perceptual metrics, with FID=19.5 vs 33.7 (VizDoom), FVD=91.8 vs 134.9 (Carla) (Kim et al., 2023). Dual SDF+3DGS achieves PSNR∼33.98 dB, SSIM 0.944, LPIPS 0.104 (Waymo NOTR) without 3D tracklets or LiDAR (Tourani et al., 15 Oct 2025).
Simulation efficiency: Scenario Dreamer attains 2x parameter reduction, 6x lower generation latency, and 10x GPU hour savings versus rasterized scene-generation methods, along with lower collision and JS divergence in both Waymo and nuPlan benchmarks (Rowe et al., 28 Mar 2025).

Method	Domain	Quantitative Lead
SceneDiffuser++	City traffic sim	JS-div 0.29 vs 0.48 baseline (WOMD-XLMap)
SceneDM	Multi-agent pred.	Realism 0.5000–0.5060 (Waymo Sim Agents)
NeuralField-LDM	3D generation	FID=19.5, FVD=91.8 (best on VizDoom, Carla, etc.)
UGSDF (SDF+3DGS)	Urban rendering	PSNR~34 dB, SSIM~0.94 (Waymo NOTR)
Scenario Dreamer	Driving sim	2x fewer params, 6x lower latency, 10x fewer GPU h

6. Limitations and Future Directions

Current SceneDM approaches encounter model and domain-specific challenges:

Drift over extended rollouts: Simulated trajectories diverge from data after long horizons ( $>$ 300 s), suggesting the need for online correction or tighter planner/world coupling (Tan et al., 27 Jun 2025).
Edge case handling: Agent collisions/offroad events during agent spawn, rare event modeling (emergency vehicles), and explicit rule compliance (stop signs, yield) require further architectural augmentation (Tan et al., 27 Jun 2025, Guo et al., 2023).
Sensor fidelity bottlenecks: Dynamic scene graph updaters’ accuracy is limited by vision system coverage and dynamic object recall (Olivastri et al., 2024).
Editability/extrapolation limits: Expressivity bottlenecks for highly nonrigid objects in SDF/deformation models, extrapolation beyond observed viewpoints, and generalization to low-data or novel domains remain open (Tourani et al., 15 Oct 2025).
Computational constraints: Reducing inference costs per sample (e.g., via distillation or accelerated samplers) is an active research direction (Tan et al., 27 Jun 2025).
Joint world/planner learning: Integrated, bi-directional learning between planning and the world model for robust closed-loop simulation is underexplored (Tan et al., 27 Jun 2025).
Autonomous goal-conditioning: Directly injecting explicit route or goal specification into diffusion-based simulators could improve coherence in trip-level simulation (Tan et al., 27 Jun 2025).

Advances in efficient latent representation, hierarchical modeling, cross-modal fusion, and explicit constraint integration are expected to further enhance the flexibility, realism, and generalizability of SceneDM.

7. Broader Implications and Research Impact

SceneDM stands as a unifying methodological axis for synthetic environment generation, dynamic simulation, and semantic mapping across several research disciplines:

Autonomous driving and robotics: Enables scalable simulation for AV evaluation, policy learning, and safety validation in diverse, realistic world models (Tan et al., 27 Jun 2025, Rowe et al., 28 Mar 2025).
3D vision and graphics: Supports high-fidelity, editable 3D asset creation, novel view synthesis, and object-centric manipulation (Kim et al., 2023, Wang et al., 2022, Tourani et al., 15 Oct 2025).
Semantic scene understanding: Advances real-time adaptive mapping and belief update for long-term robotic operation in dynamic, multi-agent environments (Olivastri et al., 2024).
Foundational research: Provides scalable benchmarks and methodologies for representation learning, generative modeling, and neural simulation.

Collectively these methods offer a systematic, data-driven toolkit at the core of modern simulation, digital twin creation, and robot-environment interaction, with ongoing evolution towards broader generality, real-time operation, and interpretability.