Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds (1906.01140v2)

Published 4 Jun 2019 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.

Citations (311)

View on Semantic Scholar

Summary

The paper presents 3D-BoNet, a novel end-to-end approach that directly regresses unoriented 3D bounding boxes alongside point-level mask prediction.
It utilizes a dual-branch architecture with MLPs to process unordered point clouds efficiently, achieving approximately 10 times faster computation than prior methods.
Experimental results on ScanNet and S3DIS datasets demonstrate improved average precision and robustness, highlighting its potential in autonomous driving, robotics, and AR.

Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

The research paper presents a new methodology called 3D-BoNet for instance segmentation on 3D point clouds, which directly regresses 3D bounding boxes for object instances while predicting point-level masks. This approach addresses the computational inefficiencies of extant methods that typically require complex post-processing steps or multiple training stages. The proposed method is single-stage, anchor-free, and end-to-end trainable, demonstrating significant improvements in efficiency and accuracy on benchmark datasets.

Framework Overview

3D-BoNet introduces a streamlined architectural framework where a backbone network is followed by two parallel branches: bounding box regression and point mask prediction. This design allows for a direct, computationally efficient processing of point clouds without necessitating intermediate procedures like non-maximum suppression or clustering. By employing multilayer perceptrons (MLPs) for per-point processing, the network is lightweight and specifically optimized to handle the unordered and non-uniform nature of point clouds.

Core Contributions

Bounding Box Regression: The network predicts rectangular, unoriented bounding boxes without relying on predefined anchors or region proposals. This is achieved using a unique bounding box association layer, optimizing predictions through a combination of loss criteria that include Euclidean distance, soft Intersection-over-Union (sIoU), and cross-entropy scores. By utilizing the assignment results from the Hungarian algorithm, the bounding boxes are effectively correlated with ground truth, facilitating efficient learning.
Point Mask Prediction: Following bounding box prediction, this branch utilizes box-aware features, combining individual and global point features to predict point-level masks. This process is robust to point cloud sparsity and non-uniformity, and leverages focal loss to handle class imbalance in instance and background points.
Computational Efficiency: Compared to state-of-the-art methods, 3D-BoNet achieves approximately 10 times greater computational efficiency, processing thousands of points in mere milliseconds. This speed does not compromise accuracy, as evidenced by superior performance on datasets like ScanNet and S3DIS.

Experimental Results

The 3D-BoNet was meticulously evaluated against existing methods across multiple object categories in the ScanNet and S3DIS datasets. It consistently achieved higher average precision (AP) and demonstrated improved handling of diverse categories without the tendency to favor specific classes. Particularly in ScanNet, the framework attained mean AP of 48.8%, topping competitive methods that integrated extensive post-processing or additional data modalities like RGB inputs.

Implications and Future Work

The findings have practical implications for autonomous driving, robotics, and augmented reality, where real-time processing and high accuracy of 3D scene understanding are critical. The proposed methodology's computational efficiency opens avenues for deploying such systems in resource-constrained environments.

Future research could focus on dynamic weighting strategies within the loss function to optimize the combination of criteria tailored to specific datasets. Additionally, the integration of advanced feature fusion modules could enhance the mutual optimization of semantic and instance segmentation branches, further improving performance across unseen categories and more complex environments.

In summary, 3D-BoNet presents a notable advancement in the task of 3D instance segmentation, offering an effective balance between computational efficiency and segmentation accuracy, suitable for a wide range of practical applications in machine perception.

PDF Markdown