End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds (1910.06528v2)

Published 15 Oct 2019 in cs.CV

Abstract: Recent work on 3D object detection advocates point cloud voxelization in birds-eye view, where objects preserve their physical dimensions and are naturally separable. When represented in this view, however, point clouds are sparse and have highly variable point density, which may cause detectors difficulties in detecting distant or small objects (pedestrians, traffic signs, etc.). On the other hand, perspective view provides dense observations, which could allow more favorable feature encoding for such cases. In this paper, we aim to synergize the birds-eye view and the perspective view and propose a novel end-to-end multi-view fusion (MVF) algorithm, which can effectively learn to utilize the complementary information from both. Specifically, we introduce dynamic voxelization, which has four merits compared to existing voxelization methods, i) removing the need of pre-allocating a tensor with fixed size; ii) overcoming the information loss due to stochastic point/voxel dropout; iii) yielding deterministic voxel embeddings and more stable detection outcomes; iv) establishing the bi-directional relationship between points and voxels, which potentially lays a natural foundation for cross-view feature fusion. By employing dynamic voxelization, the proposed feature fusion architecture enables each point to learn to fuse context information from different views. MVF operates on points and can be naturally extended to other approaches using LiDAR point clouds. We evaluate our MVF model extensively on the newly released Waymo Open Dataset and on the KITTI dataset and demonstrate that it significantly improves detection accuracy over the comparable single-view PointPillars baseline.

Authors (9)

Yin Zhou (32 papers)
Pei Sun (49 papers)
Yu Zhang (1400 papers)
Dragomir Anguelov (73 papers)
Jiyang Gao (28 papers)
Tom Ouyang (4 papers)
James Guo (3 papers)
Jiquan Ngiam (17 papers)
Vijay Vasudevan (24 papers)

Citations (324)

View on Semantic Scholar

Summary

The paper introduces a novel MVF framework that integrates birds-eye and perspective views to enhance 3D object detection in LiDAR data.
It presents dynamic voxelization to overcome fixed-size limitations, resulting in improved data utilization and more consistent voxel embeddings.
Experimental results on the Waymo and KITTI datasets show significant accuracy gains in detecting vehicles and pedestrians, especially at long ranges.

End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds: An Expert Analysis

This paper introduces an advanced approach to 3D object detection in LiDAR point clouds, entitled "End-to-End Multi-View Fusion (MVF)." The authors address limitations in current 3D detection methodologies and propose novel solutions that effectively improve object detection performance, particularly in sparse point cloud scenarios.

Overview of Contributions

The MVF method centers on leveraging complementary information from multiple viewpoints. Historically, 3D object detection in LiDAR point clouds has focused largely on either birds-eye view (BEV) or perspective view, each having intrinsic advantages and disadvantages. The BEV better maintains object metrics, making it favorable for capturing shape but lacks density in distant views. Conversely, the perspective view provides denser observations but struggles with distance invariance. This paper proposes integrating these views to maximize detection accuracy by offering a comprehensive end-to-end fusion approach.

Key contributions include:

Introduction of Dynamic Voxelization: The authors propose a dynamic voxelization (DV) approach that overcomes limitations of existing hard voxelization (HV) methods. Unlike HV, which necessitates predefined, fixed-size tensors leading to potential data loss and inefficiencies, DV dynamically assigns points to voxels without predefined limits. This results in better data utilization and more consistent voxel embeddings.
Multi-View Fusion Architecture: The MVF model integrates voxel features from both BEV and perspective views to enrich point annotations. This bidirectional relationship between points and voxels allows seamless integration of contextual information from multiple views, leveraging the strengths of each view.

Key Findings and Results

The MVF method was evaluated using the Waymo Open Dataset and the KITTI dataset, both prominent standards in autonomous driving research. Results demonstrated that MVF consistently outperforms single-view baselines:

Waymo Open Dataset: MVF achieved a notable improvement in average precision (AP) for vehicle and pedestrian detection, with significant accuracy gains observed particularly at longer ranges where single-view methods typically degrade in performance.
KITTI Dataset: On the well-established KITTI dataset for 3D car detection, MVF achieved competitive results, showcasing superior performance over baselines and comparable results to existing state-of-the-art methods.

Implications and Future Developments

The proposed MVF framework illustrates significant improvements in 3D object detection accuracy, particularly in scenarios characterized by sparse, long-range LiDAR data. The adoption of dynamic voxelization addresses critical constraints of traditional voxel-based methods, providing more stable and reliable detections. These advances are highly relevant for the autonomous driving domain, where detecting small and distant objects, such as pedestrians and signage, is crucial for safe navigation.

While the current results demonstrate the efficacy of MVF with LiDAR point clouds, further integration with other sensor modalities, such as camera data, may enhance detection performance. Future developments could explore temporal fusion techniques and cross-modal learning to capture dynamic environmental interactions more accurately.

Overall, this paper's contributions represent a promising avenue for the progression of LiDAR-based object detection systems, suggesting broader implications for real-time applications in complex environments.

PDF Markdown