Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models (2506.09042v3)

Published 10 Jun 2025 in cs.CV

Abstract: Collecting and annotating real-world data for safety-critical physical AI systems, such as Autonomous Vehicle (AV), is time-consuming and costly. It is especially challenging to capture rare edge cases, which play a critical role in training and testing of an AV system. To address this challenge, we introduce the Cosmos-Drive-Dreams - a synthetic data generation (SDG) pipeline that aims to generate challenging scenarios to facilitate downstream tasks such as perception and driving policy training. Powering this pipeline is Cosmos-Drive, a suite of models specialized from NVIDIA Cosmos world foundation model for the driving domain and are capable of controllable, high-fidelity, multi-view, and spatiotemporally consistent driving video generation. We showcase the utility of these models by applying Cosmos-Drive-Dreams to scale the quantity and diversity of driving datasets with high-fidelity and challenging scenarios. Experimentally, we demonstrate that our generated data helps in mitigating long-tail distribution problems and enhances generalization in downstream tasks such as 3D lane detection, 3D object detection and driving policy learning. We open source our pipeline toolkit, dataset and model weights through the NVIDIA's Cosmos platform. Project page: https://research.nvidia.com/labs/toronto-ai/cosmos_drive_dreams

Summary

The paper introduces a scalable synthetic data pipeline that creates spatiotemporally consistent driving videos for robust autonomous vehicle training.
It details specialized Cosmos-Drive models that convert HDMap projections and single-view inputs into multi-view, annotated, and high-quality LiDAR data.
Empirical results show improved performance in 3D lane detection, object detection, and trajectory prediction under challenging conditions.

Overview of "Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models"

The paper introduces "Cosmos-Drive-Dreams," a synthetic data generation pipeline designed to address the challenges of training autonomous vehicle (AV) systems, particularly in capturing rare edge scenarios. This pipeline leverages "Cosmos-Drive," an advanced suite of models derived from NVIDIA's Cosmos World Foundation Models, specifically optimized for the driving domain. The focus is on generating high-fidelity, multi-view, and spatiotemporally consistent driving videos that are crucial for improving downstream tasks such as perception modeling, 3D lane detection, 3D object detection, and driving policy learning.

Cosmos-Drive Models

The Cosmos-Drive suite comprises several models, each with distinct functionalities tailored to the driving domain:

Cosmos-Transfer1-7B-Sample-AV: Specializes in single-view video generation controlled by precise layouts, such as HDMaps and LiDAR depth videos. This model ensures geometric fidelity and flexibility in simulating various driving scenarios.
Cosmos-7B-Single2Multiview-Sample-AV: Facilitates the expansion of single-view videos into multi-view formats, maintaining visual consistency across multiple perspectives. This capability is vital for comprehensive AV training datasets.
Cosmos-7B-Annotate-Sample-AV: Capable of annotating in-the-wild driving videos with HDMap and LiDAR depth, broadening data accessibility by converting raw video inputs into rich semantic representations.
Cosmos-7B-LiDAR-GEN-Sample-AV: Extends Cosmos models to generate high-quality LiDAR data, enhancing simulation fidelity, especially for scenarios affected by environmental factors like weather.

Synthetic Data Generation Pipeline

The Cosmos-Drive-Dreams pipeline generates synthetic datasets through a structured process:

Generation Control: Starting from HDMap projections or annotated in-the-wild video, the system conditions video generation on structured inputs.
Prompt Rewriting: Introduced to enhance scenario diversity by varying environmental attributes such as weather and time of day using a LLM.
Multi-view Expansion: Utilizes Cosmos models to generate multi-perspective videos essential for reliable AV systems.
Quality Assurance: Implements a Vision-LLM (VLM) for automated rejection sampling, ensuring the realism and quality of synthetic data.

Empirical Evaluation

The paper reports performance improvements across key AV tasks:

3D Lane Detection: Incorporating synthetic data enhances detection accuracy, particularly in challenging conditions like rain or fog.
3D Object Detection: Utilizes synthetic data to augment training sets, leading to improved detection metrics, as demonstrated through experiments on large real-world datasets such as Waymo Open and RDS-HQ.
Policy Learning: Demonstrates measurable gains in trajectory prediction accuracy, suggesting the synthetic data's efficacy in improving model robustness.

Implications and Future Directions

The Cosmos-Drive-Dreams pipeline exemplifies how synthetic data generation can alleviate data scarcity issues in AV training, particularly for long-tail, safety-critical conditions. The availability of customizable tools and open-source resources further supports the practical deployment and continuous enhancement of AV systems. Future developments might focus on optimizing the computational efficiency of diffusion-based generation processes and broadening the application of Cosmos models to other domains requiring high-fidelity video synthesis.

Overall, the work underscores the promise of using foundation models to generate diverse, high-quality synthetic datasets that propel advancements in autonomous vehicle technologies and broader AI systems.

PDF Markdown

Tweets

https://twitter.com/CodeByPoonam/status/1934978869641261477