Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SceneDiffuser++: City-Scale Traffic Simulation

Updated 3 July 2025

SceneDiffuser++ is a generative diffusion model that synthesizes complex urban scenes by integrating dynamic agents, traffic lights, and environment elements.
It employs a unified, multi-tensor architecture with Transformer-based denoising to model spatial-temporal dependencies across complete traffic scenarios.
The system enables realistic, controllable trip simulation with dynamic agent spawning, occlusion management, and continuous state evolution for urban planning and AV validation.

SceneDiffuser++ denotes a class of generative models based on diffusion processes for comprehensive scene synthesis and simulation, including city-scale, trip-level traffic simulation, unified agent and environment modeling, and dynamic scene generation. The SceneDiffuser++ approach, as exemplified in "SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model" (2506.21976), advances the field by integrating the generation and evolution of all main traffic scene components (vehicles, pedestrians, traffic lights) within a single, end-to-end differentiable diffusion model, leveraging a multi-tensor sparse architecture and Transformer-based denoising. This positions SceneDiffuser++ as the first system to demonstrate realistic, controllable, and continuous trip simulation from point A to point B across full city maps, superseding earlier paradigms that lacked dynamic spawning/removal, occlusion management, or coherent environment control.

1. Unified Generative World Model

SceneDiffuser++ employs a generative world model architecture constructed around denoising diffusion probabilistic models (DDPMs). At its core, the model represents the state of an urban or traffic scene as a set of heterogeneous tensors: $\bm{x}_i \in \mathbb{R}^{E_i \times T \times D_i}$ where $E_i$ is the set size (agents, traffic lights), $T$ the number of simulated timesteps, and $D_i$ the feature dimensionality per type (e.g., agent position, heading, type; traffic light state, location). The collection of different tensors is referred to as the multi-tensor scene state: $\mathcal{X} := \{\bm{x}_i\}_i$ All scene types (dynamic agents and environment elements) are projected to a common latent dimension and concatenated for processing.

Central to the architecture is a Transformer-based denoiser with axial-attention, enabling cross-type and temporal dependencies to be modeled holistically. Each entity (agent or environment element) includes a validity (or visibility) channel, supporting dynamic population, removal, and occlusion handling.

The entire system is trained end-to-end under a single, unified diffusion denoising loss, thus obviating the need for task-specific modules or disparate objectives for different scene components.

2. Simulation and Dynamic Scene Evolution Mechanism

SceneDiffuser++ supports full trip simulation by repeated autoregressive rollouts of the generative diffusion model over city-scale maps. The process is as follows:

Scene Initialization: Given a large map region, the model generates an initial scene containing agents and environment configuration.
Autoregressive Rollout: At each simulation timestep, the model processes the current (possibly partially masked) scene and predicts the next state of all scene elements, including features and validity channels.
Agent Spawning and Removal: Validity predictions allow the model to dynamically introduce new agents as the AV traverses new map regions, and remove or occlude agents as required by physical layout or traffic evolution.
Traffic Light and Environment Interaction: Traffic lights are not statically placed but have their own state transitions, location control, and can be correlated with agent behavior. The model learns realistic phase diagrams and state transition likelihoods for traffic signals based on data.
Sparse Tensor Generation: During inference, soft clipping is applied to the output tensors according to the predicted validity:

$\hat{\bm{x}}_t \leftarrow \mathrm{Concat}(\mathcal{V}(\hat{\bm{x}}_t) * \mathcal{M}(\hat{\bm{x}}_t), \mathcal{M}(\hat{\bm{x}}_t))$

where $\mathcal{V}$ extracts valid feature values and $\mathcal{M}$ is the validity mask, enabling continuous and reliable scene evolution even when most slots are invalid due to dynamic changes.

3. Training Procedure and Losses

The training paradigm is fully end-to-end and leverages masked, multi-task diffusion denoising:

Masked Diffusion Loss: At each step, parts of history or future are masked (randomly, enabling context inpainting), and the network must correctly inpaint masked entities or features.
Loss Function:

$\mathbb{E}_{(\bm{x}, \mathcal{C}) \sim \mathcal{D},\, t \sim \mathcal{U}(0,1),\, \bm{m} \sim \mathcal{M},\, \epsilon_t \sim \mathcal{N}(0, \bm{I})} \Big[ || (\hat{\bm{v}_\theta(\bm{z}_t, t, \mathcal{C}) - \bm{v}_t(\bm{\epsilon}_t, \bm{x})) \cdot \bm{w} ||_2^2 \Big]$

where $\bm{w}$ selectively applies the loss to valid features or just to the validity score when the element should be inactive.

Sparse Tensor Masking: The training process is made stable and efficient by applying losses only to valid features, thus directly learning when and how to spawn, deactivate, or occlude scene elements.
Diffusion Denoising Formulation:

$q(\bm{z}_s|\bm{z}_t,\bm{x}) = \mathcal{N}(\bm{z}_t|\bm{\mu}_{t\rightarrow s},\sigma_{t\rightarrow s}^2\bm{I})$

4. Evaluation Framework and Metrics

SceneDiffuser++ is assessed on a city-scale version of the Waymo Open Motion Dataset (WOMD-XLMap), in which the simulation tracks trips over extended periods (hundreds of seconds, kilometers of real and synthetic travel). Evaluation employs a windowed, sliding measurement of statistical fidelity against held-out log data, capturing both short- and long-term realism.

Metrics include:

Jensen-Shannon (JS) divergence between simulated and logged distributions for agent counts, entering/exiting events, speed, offroad/collision rates, and traffic light statistics.
Composite scores averaging multiple metrics, as well as specific scenario statistics (trajectory smoothness, agent diversity, environment alignment).
Qualitative analysis demonstrates the model’s ability to maintain plausible traffic flow, realistic traffic light cycles, absence of agent/traffic “freezing,” and reasonable agent spawning/removal for continuous simulation.

5. Challenges Addressed and Technical Distinctions

SceneDiffuser++ resolves several historical limitations:

Dynamic population: Previous models could not spawn or remove agents; SceneDiffuser++ uses a validity channel and sparse mask loss for this ability.
Environment simulation: It models not just agents but also dynamic environment elements (traffic lights) and their evolving states.
Long-horizon, city-scale realism: The model sustains realistic, non-repetitive simulation over entire trips, unlike models that degenerate over long rollouts.
Unified Objective and Architecture: Joint training and inference over all scene elements—agents and environment—using a single backbone and loss greatly improves synergy, efficiency, and maintainability compared to modular or hand-engineered pipeline designs.

Model Aspect	SceneDiffuser++ Contribution
Architecture	Multi-tensor, Transformer-based, sparse diffusion model
Entity Modeling	Dynamic agents, traffic lights, and occlusion/validity
Dynamic Ability	Spawning/removal of agents; dynamic traffic light animation
Training	Single, end-to-end masked diffusion loss
Evaluation	Long-horizon, city-scale, distributional, and composite metrics
Application	Closed-loop AV simulation, urban planning, synthetic city analytics

6. Applications and Impact

SceneDiffuser++ holds substantial implications for research and industry:

Autonomous vehicle development: Enables closed-loop, city-scale, end-to-end scenario generation for safety validation and policy testing, mitigating the limitations of log-replay or event-based simulation.
Urban analytics and planning: Supports investigation of infrastructure policies, emergent behaviors, and rare events under realistic, variable synthetic conditions.
Sim-to-real transfer research: High-fidelity, procedural world generation aids data augmentation, robustness analysis, and domain adaptation testing in safety-critical systems.
Benchmarking and Methodological Advances: Offers standardized metrics and protocols for rigorous assessment of generative urban simulation models and their long-horizon stability—a previously neglected domain.

In summary, SceneDiffuser++ provides a comprehensive, scalable, and unified generative world model for dynamic, city-scale traffic simulation. It establishes state-of-the-art performance on all major criteria for autonomous vehicle and urban environment simulation, and paves the way for new research in generative AI for real-world systems.

PDF Markdown Chat (Upgrade)

References (1)

SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model (2025)