Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection (2012.15712v2)

Published 31 Dec 2020 in cs.CV

Abstract: Recent advances on 3D object detection heavily rely on how the 3D data are represented, \emph{i.e.}, voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint -- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a real-time frame processing rate, \emph{i.e}., at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code is available at \url{https://github.com/djiajunustc/Voxel-R-CNN}.

Authors (6)

Jiajun Deng (75 papers)
Shaoshuai Shi (39 papers)
Peiwei Li (1 paper)
Wengang Zhou (153 papers)
Yanyong Zhang (63 papers)
Houqiang Li (236 papers)

Citations (738)

View on Semantic Scholar

Summary

Voxel R-CNN: A Two-Stage Approach to Efficient 3D Object Detection

This paper introduces Voxel R-CNN, a novel framework aiming to enhance the efficiency and accuracy of 3D object detection using voxel-based representations. Voxel R-CNN addresses the challenges of 3D feature extraction while maintaining a computationally efficient approach compared to contemporary point-based and voxel-based methods.

Background and Motivation

3D object detection has significant applications in autonomous driving and robotics, often requiring a balance between precision and computational efficiency. Traditionally, methods have leaned towards using either point-based or voxel-based representations. Point-based methods excel in retaining spatial accuracy but suffer from high computational costs due to the unstructured nature of point clouds. In contrast, voxel-based methods offer efficient feature extraction through regular grid structures, yet often at the cost of precision due to granularity issues.

The authors identify a critical insight that precise positioning of raw points may not be imperative for effective 3D detection. This recognition leads to the development of Voxel R-CNN, which leverages voxel features to achieve high accuracy with reduced computational overhead.

Methodology

Voxel R-CNN employs a two-stage detection framework:

3D and 2D Backbone Networks: The framework utilizes a 3D backbone network to abstract voxel features and convert these into bird-eye-view (BEV) features. Subsequently, a 2D backbone network alongside a Region Proposal Network (RPN) generates region proposals.
Voxel RoI Pooling: A novel pooling operation aggregates features from 3D voxel feature volumes for further refinement. This operation capitalizes on voxel locality to efficiently gather feature information, markedly reducing computational costs.

The detect head then refines these RoI features for accurate object localization and classification. This structure allows the system to strike a balance between leveraging coarse voxel granularity and maintaining adequate spatial context for detection.

Experimental Results

Extensive evaluations on the KITTI and Waymo Open Datasets exhibit the robustness of Voxel R-CNN. On the KITTI test set, Voxel R-CNN achieves an impressive 81.62% AP on the moderate difficulty level, comparable to state-of-the-art point-based methods but with significantly reduced computational demand, processing at 25 FPS on an NVIDIA RTX 2080 Ti.

On the Waymo Open Dataset, Voxel R-CNN demonstrates superiority, particularly in long-range detection scenarios (50m-Inf), with an overall LEVEL_1 3D mAP of 75.59%. This outperforms previous best methods by a notable margin, validating the effectiveness of coarse voxel representation when paired with the efficient voxel pooling strategy.

Implications and Future Directions

The results presented in the paper underscore the potential of voxel-based methods to deliver efficient 3D object detection without compromising accuracy. By refining voxel representations rather than relying on point-level precision, Voxel R-CNN lays the groundwork for future exploration into optimizing voxel-based frameworks for real-time applications.

Furthermore, the incorporation of voxel RoI pooling may inspire novel pooling strategies that further exploit voxel regularity, contributing to even greater computational efficiencies. Future work may focus on extending this framework to integrate seamlessly with full 360-degree detection systems and expanding its application across varying environments and object densities.

The implications of this research extend beyond autonomous driving, potentially impacting areas such as robotics and augmented reality, where 3D understanding of environments is crucial. Voxel R-CNN establishes a pivotal baseline for further investigation into balancing 3D detection precision with scalable efficiency.

PDF Markdown

Related Papers

GitHub

GitHub - open-mmlab/OpenPCDet: OpenPCDet Toolbox for LiDAR-based 3D Object Detection. (4,404 stars)