- The paper introduces a scalable synthetic data pipeline that creates spatiotemporally consistent driving videos for robust autonomous vehicle training.
- It details specialized Cosmos-Drive models that convert HDMap projections and single-view inputs into multi-view, annotated, and high-quality LiDAR data.
- Empirical results show improved performance in 3D lane detection, object detection, and trajectory prediction under challenging conditions.
Overview of "Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models"
The paper introduces "Cosmos-Drive-Dreams," a synthetic data generation pipeline designed to address the challenges of training autonomous vehicle (AV) systems, particularly in capturing rare edge scenarios. This pipeline leverages "Cosmos-Drive," an advanced suite of models derived from NVIDIA's Cosmos World Foundation Models, specifically optimized for the driving domain. The focus is on generating high-fidelity, multi-view, and spatiotemporally consistent driving videos that are crucial for improving downstream tasks such as perception modeling, 3D lane detection, 3D object detection, and driving policy learning.
Cosmos-Drive Models
The Cosmos-Drive suite comprises several models, each with distinct functionalities tailored to the driving domain:
- Cosmos-Transfer1-7B-Sample-AV: Specializes in single-view video generation controlled by precise layouts, such as HDMaps and LiDAR depth videos. This model ensures geometric fidelity and flexibility in simulating various driving scenarios.
- Cosmos-7B-Single2Multiview-Sample-AV: Facilitates the expansion of single-view videos into multi-view formats, maintaining visual consistency across multiple perspectives. This capability is vital for comprehensive AV training datasets.
- Cosmos-7B-Annotate-Sample-AV: Capable of annotating in-the-wild driving videos with HDMap and LiDAR depth, broadening data accessibility by converting raw video inputs into rich semantic representations.
- Cosmos-7B-LiDAR-GEN-Sample-AV: Extends Cosmos models to generate high-quality LiDAR data, enhancing simulation fidelity, especially for scenarios affected by environmental factors like weather.
Synthetic Data Generation Pipeline
The Cosmos-Drive-Dreams pipeline generates synthetic datasets through a structured process:
- Generation Control: Starting from HDMap projections or annotated in-the-wild video, the system conditions video generation on structured inputs.
- Prompt Rewriting: Introduced to enhance scenario diversity by varying environmental attributes such as weather and time of day using a LLM.
- Multi-view Expansion: Utilizes Cosmos models to generate multi-perspective videos essential for reliable AV systems.
- Quality Assurance: Implements a Vision-LLM (VLM) for automated rejection sampling, ensuring the realism and quality of synthetic data.
Empirical Evaluation
The paper reports performance improvements across key AV tasks:
- 3D Lane Detection: Incorporating synthetic data enhances detection accuracy, particularly in challenging conditions like rain or fog.
- 3D Object Detection: Utilizes synthetic data to augment training sets, leading to improved detection metrics, as demonstrated through experiments on large real-world datasets such as Waymo Open and RDS-HQ.
- Policy Learning: Demonstrates measurable gains in trajectory prediction accuracy, suggesting the synthetic data's efficacy in improving model robustness.
Implications and Future Directions
The Cosmos-Drive-Dreams pipeline exemplifies how synthetic data generation can alleviate data scarcity issues in AV training, particularly for long-tail, safety-critical conditions. The availability of customizable tools and open-source resources further supports the practical deployment and continuous enhancement of AV systems. Future developments might focus on optimizing the computational efficiency of diffusion-based generation processes and broadening the application of Cosmos models to other domains requiring high-fidelity video synthesis.
Overall, the work underscores the promise of using foundation models to generate diverse, high-quality synthetic datasets that propel advancements in autonomous vehicle technologies and broader AI systems.