BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving (2205.09743v1)

Published 19 May 2022 in cs.CV

Abstract: In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.

Citations (173)

View on Semantic Scholar

Summary

The paper introduces a unified framework that jointly addresses 3D perception, semantic mapping, and motion prediction using spatio-temporal BEV representations.
It employs a spatio-temporal encoder and innovative iterative flow to efficiently extract multi-view features and boost prediction accuracy.
Experimental results on the nuScenes dataset demonstrate significant improvements in NDS, mIoU, and VPQ over traditional single-task methods.

Overview of BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

The paper "BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving" introduces an integrated framework designed for autonomous driving scenarios using multi-camera systems. This framework distinguishes itself by simultaneously addressing 3D perception and prediction tasks in a vision-centric autonomous driving environment.

BEVerse departs from the conventional paradigm of tackling single tasks separately and instead emphasizes joint reasoning through spatio-temporal Birds-Eye-View (BEV) representations. This unified approach seeks to optimize the overall efficiency and effectiveness of various driving-related tasks, such as 3D object detection, semantic map construction, and motion prediction.

Methodological Framework

The BEVerse framework involves several critical components:

Shared Feature Extraction: The process begins with multi-camera input across different timestamps. Features extracted from these images are used to generate 4D BEV representations.
Spatio-Temporal Encoder: Post the ego-motion alignment, this component extracts features in the BEV and fortifies the spatial and temporal understanding necessary for subsequent tasks.
Task Decoders: Diverse decoders then interpret the shared features, facilitating joint reasoning across various tasks. Notably, a "grid sampler" is introduced for varying granularity and range requirements, and "iterative flow" for future prediction enhances memory efficiency.

Experimental Validation

The BEVerse framework was experimentally validated using the nuScenes dataset, where it outperformed existing single-task methods for 3D object detection, semantic map construction, and motion prediction. It achieved notable results such as 53.1% NDS for object detection, 51.7% mIoU for semantic mapping (surpassing others by 7.1 points), and boosted motion prediction metrics (40.9% IoU and 36.1% VPQ).

Contributions and Implications

The paper makes several significant contributions:

Introduction of the first comprehensive framework for simultaneous 3D perception and prediction using BEV with vision-centric systems.
Development of innovative methods like iterative flow which improves both efficiency and prediction capabilities.
Demonstration that a multi-task approach, by leveraging temporal and spatial information, not only leads to state-of-the-art performance but also enhances efficiency compared to sequential task handling.

Future Directions in AI

BEVerse opens up several potential avenues for future exploration. By leveraging shared information across tasks, the approach could be extended or adapted to other areas where multi-modal data and concurrent task processing can be beneficial. Moreover, refining techniques to improve feature extraction efficiency could further enhance real-time processing capabilities in autonomous systems.

In conclusion, BEVerse represents a significant step forward in enhancing the effectiveness of autonomous driving systems through integrated task handling, indicating a promising direction for future research in vision-centric autonomous technology.

PDF Markdown

Related Papers

GitHub

GitHub - zhangyp15/BEVerse: The official repository for BEVerse (384 stars)