Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline (2301.12511v2)

Published 29 Jan 2023 in cs.CV

Abstract: Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.

Authors (11)

Yangguang Li (44 papers)
Bin Huang (56 papers)
Zeren Chen (8 papers)
Yufeng Cui (12 papers)
Feng Liang (61 papers)
Mingzhu Shen (14 papers)
Fenggang Liu (8 papers)
Enze Xie (84 papers)
Lu Sheng (63 papers)
Wanli Ouyang (358 papers)
Jing Shao (109 papers)

Citations (31)

View on Semantic Scholar

Summary

Fast-BEV: A Comprehensive Analysis

This paper introduces Fast-BEV, a framework aiming to enhance Bird’s-Eye View (BEV) perception for autonomous driving. The authors identify two significant issues in existing BEV solutions: high computational resource demands and modest performance on standard on-vehicle inference chips. Therefore, they propose Fast-BEV, a solution designed to provide a balance of high performance, rapid inference speed, and deployment efficiency on vehicle chips.

Core Components of Fast-BEV

Fast-BEV comprises five primary components, each contributing to its enhanced capabilities:

Fast-Ray Transformation: This novel approach addresses the inefficiencies of prior BEV transformations by eliminating the need for transformer-based transformations and depth representations. The Fast-Ray transformation simplifies the projection of 2D image features to 3D voxel space using a predefined Look-Up-Table, expediting the process significantly and streamlining it for deployment on on-vehicle chips.
Multi-Scale Image Encoder: Fast-BEV leverages a multi-scale image encoder that utilizes a 3-layer feature pyramid network (FPN) structure. By processing and amalgamating multi-scale information, the encoder enhances the framework's performance without incurring excessive computational costs.
Efficient BEV Encoder: The BEV encoder in Fast-BEV is optimized for rapid on-vehicle inference. It employs various dimension reduction techniques, such as "space-to-channel" (S2C) and multi-frame concatenation fusion (MFCF), substantially reducing the computational load.
Data Augmentation: To address overfitting and bolster performance, Fast-BEV implements robust data augmentation strategies for both image and BEV spaces. Techniques include flipping, rotation, and resizing, performed separately in image and BEV spaces.
Multi-Frame Feature Fusion: Fast-BEV enriches its spatial-temporal understanding by incorporating a multi-frame feature fusion mechanism, which integrates information from multiple past frames to refine the current frame perception.

Experimentation and Results

The experimental validation highlights the efficiency of Fast-BEV on the nuScenes dataset. The R50 model achieves 52.6 FPS with a 47.3% NDS on an Nvidia 2080Ti platform, surpassing existing models like BEVDepth-R50 and BEVDet4D-R50 in terms of speed and efficiency. The Fast-BEV R101 model also demonstrates competitive performance, scoring 53.5% NDS.

Implications and Future Directions

Fast-BEV addresses critical barriers in deploying BEV perception on commercial autonomous vehicles by combining efficiency with competitive performance. The elimination of depth-based processing and expensive transformer operations makes it a viable candidate for real-world deployment even on resource-constrained in-vehicle chips like Xavier and Orin.

The paper also contributes a benchmark for evaluating BEV models on existing on-vehicle chips, offering insights into latency and performance across devices with varying computational capacities. This benchmark sets a reference for industry professionals aiming to deploy BEV perception systems effectively.

Looking forward, Fast-BEV's success lays a foundation for further explorations into extending BEV perception models with more modalities, such as LiDAR or radar, and supporting additional tasks like 3D tracking or motion prediction. Given its modular design, Fast-BEV could be pivotal in developing comprehensive perception systems that cater to the multifaceted needs of autonomous driving in complex environments.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Sense-GVT/Fast-BEV: Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline (644 stars)