DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving (2412.09043v1)

Published 12 Dec 2024 in cs.CV

Abstract: Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to better learn geometry and motion features. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code is available at https://github.com/EnVision-Research/DriveRecon.

Summary

The paper introduces DrivingRecon, a model that predicts dynamic 4D Gaussian reconstructions from multi-view video inputs in a single pass.
It employs a novel Prune and Dilate Block to optimize multi-view integration and achieve superior PSNR, SSIM, and LPIPS scores compared to state-of-the-art methods.
The framework demonstrates robust scalability and generalization, highlighting its potential for unsupervised pre-training and adaptable vehicle deployment in autonomous driving.

Insights into DrivingRecon: A Large 4D Gaussian Reconstruction Model for Autonomous Driving

This essay discusses DrivingRecon, a 4D Gaussian reconstruction framework aimed at enhancing autonomous driving scene simulations. By utilizing surround-view video inputs to predict 4D Gaussians in a single feed-forward pass, DrivingRecon sets a new precedent in dynamic scene reconstruction for self-driving vehicles. Despite advancements in autonomous driving, the challenge of robust, photorealistic scene reconstruction for large-scale, dynamic environments persists. The paper details how DrivingRecon addresses these challenges, evaluating its efficacy across multiple tasks and datasets, including model pre-training, vehicle adaptation, and scene editing.

Technical Framework

Overall Architecture

DrivingRecon operates primarily through a feed-forward large 4D Gaussian Model capable of leveraging temporal multi-view images. The model is built upon a core architecture featuring a 2D convolutional encoder that processes multi-view inputs, succeeded by a depth estimation module and a temporal cross-attention mechanism. These components efficiently synthesize spatial and temporal data, crucial for generating high-fidelity 4D reconstructions.

Prune and Dilate Block (PD-Block)

A pivotal innovation, the Prune and Dilate Block, enhances multi-view integration by removing redundant Gaussian points and enabling the dilation of Gaussian representations around complex regions. This module effectively moderates computational redundancy, ensuring a concentration of resources on complex scene areas while simplifying similar or overlapping perspectives across different views. It operates by segmenting input feature maps into smaller regions, combining range view and cosine similarity with a threshold-based selection mechanism to enhance feature aggregation and consolidation.

Rendering

DrivingRecon incorporates dynamic and static rendering stratagems, leveraging optical flow prediction for dynamic scene components. This dual approach ensures accurate Gaussian representation for both moving and static objects, enhancing the temporal supervision required to predict and render motion information across sequences.

Empirical Evaluation

The paper's empirical setup utilized the NOTR subset of the Waymo Open Dataset, and Diverse-56 to assess the performance of the proposed model. Training was executed over 50,000 iterations on NVIDIA A100 GPUs using a bfloat16 precision. This setup allowed for comprehensive testing across various environmental conditions and scenarios typical of autonomous driving.

In-Scene and Cross-Scene Evaluations

Several state-of-the-art methods were benchmarked against, including LGM, pixelSplat, MVSPlat, and L4GM. The comparative evaluations demonstrated strong quantitative results in PSNR, SSIM, and LPIPS metrics, significantly outperforming baselines in both static and dynamic scene aspects. The algorithm's cross-scene evaluations underscore its adaptability, managing novel environments without significant performance degradation, a testament to its generalization capability.

Scalability and Generalization

DrivingRecon was subjected to extensive ablation studies, confirming the contribution of individual components like the PD-Block and 3D-Aware Position Encoding to performance improvements. Furthermore, the ability to scale across different training dataset sizes without losing effectiveness indicates the model's robust design for generalization tasks.

Practical and Theoretical Implications

Pre-training and Vehicle Adaptation

One fascinating implication of DrivingRecon is its potential for unsupervised pre-training, hinting at broader applications in network initialization for task-specific fine-tuning, evidenced by superior performance of pre-trained models on downstream tasks. Additionally, the model's capability to handle varying camera parameters underlines its adaptability, crucial for scalable deployment across diverse vehicle architectures.

Future Developments

Notwithstanding its advances, future work might delve into enhancing the efficiency of PD-Blocks or explore diverse geometric representations. Incorporating real-time constraints and optimizing computational demands is essential for practical deployment in complex, real-world environments. Furthermore, extending its framework to integrate emerging sensor technologies could refine its prediction and reconstruction accuracy.

The authors' provision of the codebase at their GitHub repository ensures transparency and facilitates community-driven advancements in autonomous driving simulation, encapsulating both the practical impact and theoretical contributions of DrivingRecon.

PDF Markdown

Related Papers

GitHub

GitHub - EnVision-Research/DriveRecon (18 stars)