Deep Hough Voting for 3D Object Detection in Point Clouds (1904.09664v2)

Published 21 Apr 2019 in cs.CV

Abstract: Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird's eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds. In this work, we return to first principles to construct a 3D detection pipeline for point cloud data and as generic as possible. However, due to the sparse nature of the data -- samples from 2D manifolds in 3D space -- we face a major challenge when directly predicting bounding box parameters from scene points: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step. To address the challenge, we propose VoteNet, an end-to-end 3D object detection network based on a synergy of deep point set networks and Hough voting. Our model achieves state-of-the-art 3D detection on two large datasets of real 3D scans, ScanNet and SUN RGB-D with a simple design, compact model size and high efficiency. Remarkably, VoteNet outperforms previous methods by using purely geometric information without relying on color images.

Authors (4)

Charles R. Qi (31 papers)
Or Litany (69 papers)
Kaiming He (71 papers)
Leonidas J. Guibas (75 papers)

Citations (1,183)

View on Semantic Scholar

Summary

Deep Hough Voting for 3D Object Detection in Point Clouds

The paper Deep Hough Voting for 3D Object Detection in Point Clouds presents a novel approach to 3D object detection using point cloud data, leveraging an end-to-end deep learning framework termed VoteNet. This framework fuses state-of-the-art deep point set networks with the classical Hough voting mechanism to efficiently detect and localize objects in 3D space.

Problem Context and Motivations

Traditional methods for 3D object detection have predominantly relied on adaptations of 2D detectors, often converting 3D point clouds into regular grids or using 2D image-based detection as a precursor to 3D bounding box proposal. These approaches, however, suffer from certain limitations, such as the high computational cost of 3D convolutions, loss of spatial detail in cluttered environments, and dependence on 2D detectors, which may miss objects in occluded or low-illumination scenarios. The VoteNet methodology circumvents these issues by directly processing raw point clouds, thus exploiting the inherent geometric information contained within them.

Key Contributions and Methodology

VoteNet redefines the Hough voting process within an end-to-end differentiable architecture, embodying the following crucial innovations:

Point Cloud Feature Learning: Utilizing PointNet++ as the backbone network, the model processes and extracts features from the point cloud data without converting them into a regular grid structure, thereby preserving spatial detail and exploiting data sparsity.
Voting Mechanism: The core contribution is the introduction of a voting process that transforms seed points into a set of votes. Each vote represents a prediction of object centers and associated features, thereby clustering around object centers in the point cloud. This method enhances the ability to aggregate context from sparse 3D data.
Proposal Generation: Using a learned module inspired by the classical Hough voting, VoteNet clusters and aggregates votes to generate 3D bounding box proposals. This process is entirely differentiable, allowing the entire pipeline to be trained jointly.
End-to-End Optimization: Unlike traditional Hough voting involving multiple separated modules, VoteNet's unified deep learning framework ensures joint optimization of all the components, making the detection process highly efficient and accurate.

Experimental Evaluation

The performance of VoteNet was benchmarked on two comprehensive 3D detection datasets, namely SUN RGB-D and ScanNet. In both cases, VoteNet demonstrated state-of-the-art results, notably outperforming prior methods that utilized both geometric and RGB data. For instance, on the SUN RGB-D dataset, VoteNet achieved an mAP of 57.7%, surpassing the previous best method F-PointNet by 3.7 points, which used both RGB and geometric information. Similarly, on the ScanNet dataset, VoteNet reported an mAP of 58.6%, an improvement of 18.4 points over the previous state-of-the-art methods.

Numerical Results and Analysis

The analysis revealed significant contributions from the proposed voting mechanism:

The effectiveness of voting was underscored by comparative results between VoteNet and a baseline model (BoxNet), which directly predicted 3D boxes from seed points without voting. VoteNet demonstrated a notable performance boost, particularly in object categories where the object's geometric center is distant from its surface points.
VoteNet's ability to utilize local vote geometry for more accurate bounding box proposals was further assessed by evaluating different vote aggregation mechanisms. Learned aggregation, as in the PointNet-based module, provided superior performance compared to simpler max or average pooling strategies.

Implications and Future Directions

Practically, VoteNet's efficacy and efficiency hold significant implications for applications in autonomous navigation, robotic vision, and augmented reality, where real-time and accurate 3D object detection is critical. The model's compact size and rapid inference speed (4x smaller and 20x faster than competing methods) make it particularly suitable for deployment in real-time systems.

Theoretically, the synergy of Hough voting with deep learning architectures opens several avenues for future research. Potential directions include extending the VoteNet framework to incorporate color information from RGB-D datasets, exploring more sophisticated feature extraction backbones, and generalizing the approach to other 3D recognition tasks such as 6D pose estimation and 3D instance segmentation.

In conclusion, this paper contributes a robust new framework for 3D object detection grounded in the principles of deep learning and Hough voting, yielding significant advancements in both performance and computational efficiency.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos