StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models (2412.13188v2)

Published 17 Dec 2024 in cs.CV

Abstract: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.

Summary

The paper presents StreetCrafter, a LiDAR-conditioned video diffusion model that synthesizes precise street views for autonomous driving.
It introduces real-time rendering and novel view synthesis with superior results on Waymo and PandaSet datasets.
The approach enables dynamic scene editing by manipulating LiDAR points, offering flexibility for simulation and testing.

Overview of "StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models"

The paper "StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models" addresses the problem of photorealistic view synthesis from vehicle sensor data, particularly in the domain of autonomous driving. It focuses on overcoming the limitations of previous methods in rendering high-quality autonomous driving scenes, which often deteriorate when the viewpoint significantly deviates from the training trajectory.

Key Contributions

The core innovation of the paper is the introduction of StreetCrafter, a controllable video diffusion model that uses LiDAR point cloud renderings as pixel-level conditioning. This approach leverages the geometric accuracy of LiDAR data to provide precise control in camera view synthesis, even for novel pathways, thus improving upon the issues faced by earlier models where extrapolation resulted in artifacts.

Integration of LiDAR Conditions: The model incorporates pixel-level LiDAR conditions to facilitate precise camera control, thereby enhancing the capacity to synthesize novel views. This integration allows for accurate pixel-level modifications of target scenes, and such fine-grained control is particularly beneficial in dynamic street environments with moving vehicles and pedestrians.
Novel View Synthesis and Real-time Rendering: StreetCrafter's utilization of the generative prior can be seamlessly distilled into dynamic scene representations, enabling real-time rendering performance. This capability is crucial for practical autonomous driving applications where scenarios on streets can dynamically change and require immediate adaptation.
Strong Numerical Results: Through experimental evaluations on the Waymo Open Dataset and PandaSet, StreetCrafter demonstrated superior performance over existing methods. It showed consistent improvements in scenarios requiring flexible viewpoint changes, with impressive results in rendering quality for both interpolation and extrapolation tasks.
Additionally Enabled Scene Editing: Without per-scene optimization, StreetCrafter supports various scene editing operations, such as object removal, replacement, and translation, solely by manipulating LiDAR points. This flexibility highlights its potential utility in simulations and testing of autonomous driving systems.

Implications and Theoretical Impact

The theoretical contribution of StreetCrafter lies in the innovative use of diffusion models, traditionally used for generation tasks, engineered here for controllable synthesis with real-world data constraints. This advance bridges a crucial gap in the area of realistic scene synthesis by offering a methodology that combines the robustness of physical LiDAR data with sophisticated generative modeling.

By effectively distilling information into a dynamic 3D Gaussian Splatting framework, StreetCrafter enhances both the fidelity and speed of view synthesis, thus opening avenues for more reliable and efficient autonomous vehicle simulations. Moreover, the model's success in handling diverse dynamic scenarios, such as multi-lane synthesis, indicates its potential for deployment in urban simulation environments, offering rich datasets for training and validating autonomous driving algorithms.

Speculations on Future Developments

The introduction of StreetCrafter suggests several areas for potential future research and development. One direction could focus on reducing the computational overhead associated with high-resolution LiDAR data and video diffusion processes, striving to meet more stringent real-time performance requirements. Additionally, extending this framework to incorporate additional sensor modalities beyond LiDAR, such as radar, could further enhance scene understanding and robustness in occluded environments.

Furthermore, the implications of pixel-level controllability suggest that more interactive applications could emerge, whereby simulation environments for autonomous systems can be dynamically updated in response to real-world changes or learning objectives. As AI progresses, such methods could contribute to more adaptive, safe, and intelligent autonomous systems, capable of operating in increasingly complex urban landscapes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1869623802323468541

https://twitter.com/jbohnslav/status/1869782057062469754

Reddit

[2412.13188] StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models (1 point, 0 comments)