HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection (2003.00186v2)

Published 29 Feb 2020 in cs.CV, cs.AI, and cs.RO

Abstract: We present Hybrid Voxel Network (HVNet), a novel one-stage unified network for point cloud based 3D object detection for autonomous driving. Recent studies show that 2D voxelization with per voxel PointNet style feature extractor leads to accurate and efficient detector for large 3D scenes. Since the size of the feature map determines the computation and memory cost, the size of the voxel becomes a parameter that is hard to balance. A smaller voxel size gives a better performance, especially for small objects, but a longer inference time. A larger voxel can cover the same area with a smaller feature map, but fails to capture intricate features and accurate location for smaller objects. We present a Hybrid Voxel network that solves this problem by fusing voxel feature encoder (VFE) of different scales at point-wise level and project into multiple pseudo-image feature maps. We further propose an attentive voxel feature encoding that outperforms plain VFE and a feature fusion pyramid network to aggregate multi-scale information at feature map level. Experiments on the KITTI benchmark show that a single HVNet achieves the best mAP among all existing methods with a real time inference speed of 31Hz.

Citations (171)

View on Semantic Scholar

Summary

The paper introduces a Hybrid Voxel Feature Encoder that decouples feature extraction from projection scales to enable effective multi-scale fusion.
It employs an attentive voxel encoding mechanism that prioritizes relevant features, outperforming traditional point-wise methods like PointNet.
The integrated Feature Fusion Pyramid Network achieves superior mean Average Precision on the KITTI benchmark at real-time speeds, notably improving cyclist detection.

HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection

The research presented in the paper introduces HVNet, a one-stage unified network designed for point cloud-based 3D object detection, specifically within the domain of autonomous driving. HVNet tackles the limitations associated with voxel size selection in LiDAR data, which is essential for balancing computational efficiency and detection accuracy.

Key Contributions

Hybrid Voxel Feature Encoder (HVFE): The paper proposes a novel encoder which effectively fuses voxel features from multiple scales at a point-wise level, thereby addressing the challenge of voxel size selection. This encoder decouples the feature extraction scales from the feature map projection scales, allowing for efficient multi-scale aggregation that enhances detection performance without compromising inference speed.
Attentive Voxel Feature Encoding: HVNet introduces an attentive feature encoding mechanism which demonstrates superior performance compared to standard point-wise encoding techniques such as PointNet. This attentive mechanism selectively prioritizes relevant voxel features, thereby refining the detection accuracy with minimal computational overhead.
Feature Fusion Pyramid Network (FFPN): A pyramid network is employed to aggregate multi-scale information, enhancing the representation of object features across different spatial resolutions. This architectural choice aids in improving detection accuracy across diverse object sizes.

Numerical Results

The experiments conducted on the KITTI benchmark illustrate HVNet's prowess, achieving the best mean Average Precision (mAP) among existing methods, including both one-stage and two-stage models, with a real-time operating speed of 31 Hz. Notably, HVNet excelled in detecting cyclist objects, outperforming competitors in both mAP and detection speed.

Implications and Future Directions

The practical implications of HVNet extend to various autonomous driving applications, where real-time 3D detection is crucial for safe navigation. The hybrid approach to voxel feature encoding sets a precedent for further research into scalable and efficient 3D detection systems. Theoretically, this paper enriches the understanding of feature extraction and representation in sparse point cloud environments, a topic increasingly relevant in machine learning and robotics.

For future work, exploration into more generalized versions of HVNet that address other types of sensor data, like radar or camera-based systems, could be beneficial. Additionally, integrating HVNet with emerging neural architectures or exploring unsupervised learning techniques could provide further advancements in robustness and adaptability of autonomous systems.

In summary, HVNet epitomizes an innovative stride in 3D object detection, offering valuable insights and practical solutions within autonomous driving technology.

PDF Markdown