Weakly Supervised 3D Object Detection from Point Clouds (2007.13970v1)

Published 28 Jul 2020 in cs.CV

Abstract: A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes. Existing 3D object detectors heavily rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios. Weakly supervised learning is a promising approach to reducing the annotation requirement, but existing weakly supervised object detectors are mostly for 2D detection rather than 3D. In this work, we propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. First, we introduce an unsupervised 3D proposal module that generates object proposals by leveraging normalized point cloud densities. Second, we present a cross-modal knowledge distillation strategy, where a convolutional neural network learns to predict the final results from the 3D object proposals by querying a teacher network pretrained on image datasets. Comprehensive experiments on the challenging KITTI dataset demonstrate the superior performance of our VS3D in diverse evaluation settings. The source code and pretrained models are publicly available at https://github.com/Zengyi-Qin/Weakly-Supervised-3D-Object-Detection.

Authors (3)

Zengyi Qin (15 papers)
Jinglu Wang (29 papers)
Yan Lu (179 papers)

Citations (55)

View on Semantic Scholar

Summary

Weakly Supervised 3D Object Detection from Point Clouds

The paper "Weakly Supervised 3D Object Detection from Point Clouds" introduces a novel framework called VS3D for efficiently detecting 3D objects in point cloud data with minimal reliance on annotated training data. This approach addresses a significant challenge in scene understanding by eliminating the necessity for costly and labor-intensive ground truth 3D bounding box annotations, which are typically required in conventional 3D object detectors.

Framework Overview

VS3D comprises two primary components: an unsupervised 3D object proposal module and a cross-modal knowledge transfer mechanism.

Unsupervised 3D Object Proposal Module (UPM):
- The UPM leverages normalized point cloud densities to identify regions within the point clouds likely to contain objects. The module does not require any ground truth annotations, making it a cost-effective solution. It first generates proposals by selecting 3D anchors based on the normalized density of point clouds, which accounts for variations in data density due to differing sensor distances.
- Importantly, the UPM modifies anchor selection to consider the normalized point cloud density, addressing challenges posed by spatial inconsistencies typically encountered in raw LiDAR data.
Cross-Modal Transfer Learning:
- This component involves a student-teacher network framework. The student network (operating on point clouds) learns to mimic the predictions of a teacher network (trained on image datasets). This method enhances the student network's ability to classify proposals into specific object categories and refine their orientation without requiring annotated 3D bounding boxes.
- The transfer of knowledge is facilitated through a convolutional neural network (CNN), which utilizes features obtained from image datasets, thereby reducing the need for direct point cloud annotations.

Experimental Evaluation

The framework's efficacy was rigorously tested on the KITTI dataset, a benchmark known for its challenging and diverse evaluation settings. The results demonstrate that VS3D substantially improves performance over previous weakly-supervised object detection methods:

3D Recall and Average Precision: The paper reports significant improvements in 3D recall rates, achieving over 50% enhancement in average precision compared to existing weakly supervised detection methods. This highlights the advantage of the unsupervised proposal generation and cross-modal knowledge transfer.
Usage of Different Input Types: The effectiveness of VS3D was evaluated using various input data sources, including monocular images, stereo images, and LiDAR points. The method shows robust performance across different modalities, indicating its adaptability and practical utility in diverse real-world scenarios.

Implications and Future Directions

The proposed approach holds substantial implications for the fields of robotics, autonomous driving, and augmented reality, where 3D object detection is critical yet often hampered by the lack of sufficient annotated data. By reducing the dependency on extensive 3D labeling, VS3D facilitates the deployment of 3D detection systems in novel environments, promising an efficient alternative for the growing applications of 3D vision systems.

Looking forward, VS3D paves the way for further research in unsupervised and weakly-supervised learning paradigms. It suggests promising avenues in exploring more sophisticated transfer learning mechanisms that could bridge the gap between 2D and 3D domain representations. Additionally, integration with other sensory inputs or the exploration of self-supervised learning approaches could lead to enhanced adaptability and performance in dynamic and complex environments.

In conclusion, VS3D presents a compelling weakly supervised learning framework that successfully leverages point cloud density features and cross-modal knowledge for efficient 3D object detection. The methodology contributes significantly to minimizing annotation requirements in practical applications while maintaining high detection accuracy.

PDF Markdown

Related Papers

GitHub

GitHub - Zengyi-Qin/Weakly-Supervised-3D-Object-Detection: Weakly Supervised 3D Object Detection from Point Clouds (VS3D), ACM MM 2020 (106 stars)