- The paper presents a comprehensive survey of Driving World Models, categorizing approaches into 2D scene evolution, 3D occupancy prediction, and scene-free paradigms.
- It details innovative generative techniques and evaluation metrics such as FID, FVD, and IoU to enhance simulation, data augmentation, and anticipative driving.
- The survey identifies challenges like data scarcity and inference efficiency while proposing future research directions for robust, end-to-end autonomous driving solutions.
This survey paper presents a comprehensive overview of recent advancements in Driving World Models (DWM), a paradigm focused on predicting scene evolution to enhance autonomous driving systems. The authors categorize existing DWM approaches based on the modalities of predicted scenes (2D, 3D, and scene-free) and summarize their contributions to autonomous driving. Additionally, the paper reviews high-impact datasets and metrics tailored to different tasks within DWM research, discusses current limitations, and proposes future research directions.
The authors categorize DWM approaches based on predicted scene modalities:
- 2D Scene Evolution: These models employ generative techniques, such as autoregressive transformers and diffusion models, to predict photorealistic 2D scene evolution while ensuring physical plausibility. Methods like GAIA-1 formulate scene evolution prediction as a next-token prediction task using a diffusion decoder. DriveDreamer advances conditional diffusion frameworks for multi-modal control and synthetic data generation. Fidelity is enhanced by Vista, which employs stable video diffusion and novel loss functions, and DrivePhysica, which introduces 3D bounding box coordinate conditions. Temporal consistency is addressed by InfinityDrive, with its multi-resolution spatiotemporal modeling framework, and DrivingWorld, which uses temporal-aware tokenization and balanced attention strategies. Control conditions, including low-level (action, trajectory, layouts) and high-level (text, destination) inputs, are integrated to generate future scenarios, as seen in GEM and DriveDreamer-2.
- 3D Scene Evolution: These models focus on predicting 3D scene evolution using occupancy grids and point clouds. Occupancy-based methods, such as OccWorld, employ spatial-temporal transformers. OccLLaMA integrates a multi-modal LLM, while RenderWorld tokenizes air and non-air grids. Diffusion-based approaches like OccSora and DOME further improve controllability and generation quality. Efficiency improvements are explored in DFIT-OccWorld and GaussianWorld. Some methods reconstruct occupancy from images, such as DriveWorld and Drive-OccWorld, or derive occupancy pseudo-labels from point clouds, as in UnO, UniWorld, and NeMo. Point cloud-based methods, such as Copilot4D and LidarDM, address the challenges posed by the sparse and unstructured nature of point clouds. Visual point cloud forecasting is explored by ViDAR and HERMES. Multi-sensor fusion is achieved in MUVO, BEVWorld, and HoloDrive.
- Scene-Free Paradigms: These models explore predictions without detailed scenes, focusing on latent state transitions and agent-centric motion dynamics. Reinforcement Learning-based planners often leverage latent DWM, as seen in Think2Drive and LAW. Multi-agent behavior prediction is addressed in TrafficBots, CarFormer, and AdaptiveDriver.
The paper also discusses applications of DWM in autonomous driving:
- Simulation: DWM can simulate the driving process conditioned on various input forms, such as actions and captions, to address the limitations of traditional simulators.
- Data Generation: DWM can generate diverse driving videos using the same annotations to augment datasets, improving scenario coverage and mitigating the gap with real-world data.
- Anticipative Driving: By predicting future states, DWM enhances a vehicle's planning capabilities and improves safety and adaptability.
- 4D Pre-training: DWM leverages large amounts of unlabeled multi-modal data for 4D pre-training, enhancing the performance of downstream driving tasks.
Datasets reviewed include CARLA, KITTI, KITTI-360, Waymo Open, nuScenes, Argoverse2, OpenScenes, OpenDV-2K, and DrivingDojo. Metrics discussed include Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), Chamfer Distance (CD), Intersection over Union (IoU), Mean IoU (mIoU), Displacement Error (L2), Collision Rate, Route Completion (RC), Infraction Score (IS), and Driving Score (DS).
The authors identify limitations of current DWM research: scarce data, inference efficiency, reliable simulation, hallucinations and unrealistic physics, unified task understanding (incorporating language tasks), multi-sensor modeling, and vulnerability to adversarial attacks. Future research directions include data augmentation, efficient representations, robust generalization, cross-modal validation, end-to-end DWM, leveraging unaligned multi-sensor data, and developing defense strategies against adversarial attacks.