DreamDrive: Generative 4D Scene Modeling from Street View Images (2501.00601v2)

Published 31 Dec 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.

Summary

The paper introduces a novel pipeline that integrates video diffusion with hybrid Gaussian splatting to generate geometrically consistent 4D driving scenes.
The method employs self-supervised scene decomposition, separating static backgrounds from dynamic objects, and improves visual fidelity metrics by 30%.
Results on the nuScenes dataset demonstrate practical benefits for autonomous driving, including potential collision metric reductions of up to 25%.

Generative 4D Scene Modeling for Autonomous Driving

The academic paper entitled "DreamDrive: Generative 4D Scene Modeling from Street View Images" introduces an innovative methodology for synthesizing dynamic driving scenes necessary for training autonomous driving systems. This work advances existing generative and reconstruction-based methods, which have traditionally faced significant constraints regarding scalability and scene consistency.

At the core of the DreamDrive framework is the amalgamation of generative video diffusion models and Gaussian splatting, creating a robust pipeline that can generate and render 4D (spatial-temporal) scenes with remarkable geometry consistency and visual fidelity. The approach begins by leveraging video diffusion models trained on street view data to generate a sequence of 2D visual references, which are then elevated into a 4D scene using a novel hybrid Gaussian representation. This representation separates static backgrounds from dynamic objects in a scene, a process that is executed via self-supervised learning, effectively replacing the need for manually annotated data and ensuring generalizability to diverse in-the-wild driving scenarios.

The paper reports strong numerical results using the nuScenes dataset and various in-the-wild street views, highlighting the system's ability to generate high-quality and 3D-consistent novel driving scenes. This is achieved through a self-supervised scene decomposition, where static features are maintained while dynamic elements are treated with varying time dependencies. The method demonstrates substantial improvements over prior art, with a 30% enhancement in visual quality metrics such as FID and FVD, attributed to the algorithm's ability to produce precise geometric details and robust dynamic object modeling.

Practical implications for autonomous driving include enhanced synthetic data generation for self-driving perception and planning models. The system's capacity to synthesize scenarios from diverse geographical areas bolsters its application in training AI models beyond controlled environments. DreamDrive also facilitates trajectory planning by offering a consistent evaluation of onboard systems' outputs against dynamically generated ground truths, potentially leading to a significant reduction in model-training collision metrics by up to 25% when applied to existing planning models.

Theoretically, the DreamDrive framework contributes novel insights into the integration of generative priors with physical scene rendering techniques. Its self-supervised decomposition and hybrid Gaussian representation offer a pathway toward resolving long-standing challenges in dynamic scene generation, particularly in scenarios lacking rich data annotations.

Looking forward, this research could pave the way for broader applications in AI, particularly as we navigate toward increasingly complex and less predictable environments. Future work might explore integrating this approach with real-time sensor data within autonomous vehicles, further enriching scene fidelity and system reliability in operational settings. Additionally, extending this framework to accommodate various terrains and lighting conditions could further enhance its robustness and applicability in outdoor AI systems. The synergy between robust generative models and precise scene rendering, as demonstrated by DreamDrive, is poised to be a valuable asset in the ongoing development of autonomous systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/drmapavone/status/1875270310238548069

https://twitter.com/zhenjun_zhao/status/1875032320690049423

YouTube

Show All Videos