UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving (2305.18829v5)

Published 30 May 2023 in cs.CV, cs.MM, and cs.RO

Abstract: Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we propose the first multi-camera unified pre-training framework, called UniScene, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, we employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world through pre-training. A significant benefit of UniScene is its capability to utilize a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniScene.

References (58)

Authors (5)

Chen Min (17 papers)
Liang Xiao (80 papers)
Dawei Zhao (22 papers)
Yiming Nie (9 papers)
Bin Dai (60 papers)

Citations (14)

View on Semantic Scholar

Summary

Multi-Camera Unified Pre-Training via 3D Scene Reconstruction

The research paper introduces a novel approach to multi-camera 3D perception in autonomous driving, which has increasingly become a focal point in the field, as it provides a more cost-efficient alternative to LiDAR-based solutions. The current state-of-the-art methodologies rely heavily on monocular 2D pre-training, which often neglects the spatial and temporal correlations inherent in multi-camera systems. To bridge this gap, the authors propose a unified pre-training framework named UniScene, which emphasizes 3D scene reconstruction as a foundational pre-training phase.

The essence of the UniScene framework lies in its ability to harness 3D geometric occupancy as a primary representation for the pre-training models. The approach enables the incorporation of geometric priors from the surrounding environment, a critical evolution from the current methods that focus mainly on monocular depth estimation. UniScene's methodology leverages unlabeled image-LiDAR pairs for pre-training, which offers a significant advantage in terms of reducing the manual annotation costs associated with 3D training, with a reported reduction by 25%.

The experimental evaluation within the paper utilizes the nuScenes dataset and exhibits a marked performance improvement over monocular pre-training methods. Specifically, the UniScene model demonstrates a 2.0% improvement in mAP and NDS metrics for multi-camera 3D object detection and a 3.0% increase in mIoU for surrounding semantic scene completion tasks. This enhanced performance can be attributed to the model’s capacity to effectively process the rich spatial and temporal information from the image-LiDAR data. The capability of reconstructing entire 3D occupancy grids rather than solely relying on depth estimation from monocular views is a pivotal advancement.

Methodologically, UniScene incorporates a multi-camera unified pre-training process that constructs 3D scene reconstructions. A noteworthy decision within this process is the multi-frame fusion approach in which LiDAR point clouds from several frames are combined to produce occupancy labels, enhancing the richness and accuracy of the data representation. The training process employs a focal loss approach to address the inherent imbalance in the occupancy prediction task, focusing on ensuring accurate predictions of occupied voxels.

Besides providing empirical evidence of improved accuracy and precision in 3D perception, the authors speculate on the potential applications and extensions of their framework. The UniScene model lays groundwork that future research can build on, particularly towards enhancing the perception of autonomous systems through the integration of more sophisticated 3D reconstruction methodologies, potentially involving NeRF or MVS techniques.

The implications of this research are substantial both theoretically and practically. From a theoretical standpoint, this work challenges the traditional reliance on 2D single-view pre-training in multi-camera systems by presenting a robust alternative that captures more comprehensive spatial and temporal dynamics. Practically, the adoption of such frameworks in real-world autonomous driving scenarios can catalyze advancements in system performance while simultaneously reducing data annotation overheads.

In conclusion, UniScene’s paradigm shift towards 3D scene reconstruction for multi-camera systems sets a new direction for autonomous vehicle perception systems. This methodology not only exhibits superior performance in benchmark tests but also offers a scalable, cost-effective approach for future developments in AI-driven perception systems. The framework’s advancement could play a pivotal role in fulfilling the increasing demands for efficient, precise, and scalable autonomous systems.

PDF Markdown

GitHub

GitHub - chaytonmin/UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for DETR3D, BEVFormer, BEVDet, BEVDepth and Semantic Occupancy Prediction (227 stars)

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving (2305.18829v5)

Summary

Multi-Camera Unified Pre-Training via 3D Scene Reconstruction

Related Papers

GitHub