3D Focusing-and-Matching Network for Multi-Instance Point Cloud Registration

Published 12 Nov 2024 in cs.CV | (2411.07740v2)

Abstract: Multi-instance point cloud registration aims to estimate the pose of all instances of a model point cloud in the whole scene. Existing methods all adopt the strategy of first obtaining the global correspondence and then clustering to obtain the pose of each instance. However, due to the cluttered and occluded objects in the scene, it is difficult to obtain an accurate correspondence between the model point cloud and all instances in the scene. To this end, we propose a simple yet powerful 3D focusing-and-matching network for multi-instance point cloud registration by learning the multiple pair-wise point cloud registration. Specifically, we first present a 3D multi-object focusing module to locate the center of each object and generate object proposals. By using self-attention and cross-attention to associate the model point cloud with structurally similar objects, we can locate potential matching instances by regressing object centers. Then, we propose a 3D dual masking instance matching module to estimate the pose between the model point cloud and each object proposal. It performs instance mask and overlap mask masks to accurately predict the pair-wise correspondence. Extensive experiments on two public benchmarks, Scan2CAD and ROBI, show that our method achieves a new state-of-the-art performance on the multi-instance point cloud registration task. Code is available at https://github.com/zlynpu/3DFMNet.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that decomposes multi-instance registration into focused object localization using KPConv, attention, and DBSCAN, followed by robust pairwise matching with dual masking.
It leverages optimal transport and SVD-based pose estimation to establish dense, reliable correspondences even under clutter and occlusion.
Experimental results on datasets like ROBI and Scan2CAD highlight its potential in practical applications such as robotic bin picking and augmented reality.

This paper introduces the 3D Focusing-and-Matching Network (3DFMNet) (2411.07740) for the challenging task of multi-instance point cloud registration. The core idea is to decompose the complex one-to-many registration problem (aligning one model to multiple instances in a scene) into multiple, more manageable one-to-one pairwise registration problems. This is achieved through a two-stage pipeline: first, focusing on potential object locations, and then matching the model to each localized object proposal.

The motivation behind this approach is that existing methods often struggle with accurately finding correspondences between a single model and many cluttered and occluded instances in a scene. By localizing potential objects first, the subsequent matching step only needs to consider the correspondence between the model and a smaller, focused region containing a single potential object.

Stage 1: 3D Multi-Object Focusing Module

This module aims to identify and localize potential object instances in the scene point cloud $\bm{P}$ given the model point cloud $\bm{Q}$ .

Feature Extraction: Multi-scale features $\bm{F}_p$ and $\bm{F}_q$ are extracted from $\bm{P}$ and $\bm{Q}$ respectively using a KPConv encoder backbone.
Feature Correlation: Self-attention is applied to $\bm{F}_p$ and $\bm{F}_q$ independently, followed by cross-attention between $\bm{F}_p$ and $\bm{F}_q$ to embed the relationship between the model and scene into the scene features, producing $\bm{H}_p$ .
Object Center Prediction: MLPs are used to predict two outputs for each point in the subsampled scene point cloud:
- An offset vector $\bm{V}_p$ pointing from the point to its instance center.
- An instance mask score $\bm{Y}_p$ indicating if the point belongs to an object (vs. background). Geodesic distance embedding $\bm{G}_p$ is used in predicting $\bm{Y}_p$ .
Center Estimation: Points predicted as belonging to an object (mask score > 0.5) are displaced using their predicted offset vectors ( $\bm{P} + \bm{V}_p$ ). DBSCAN clustering is applied to these displaced points to group them into potential instance clusters. The center of each cluster is calculated by averaging the points within it, yielding the predicted object centers $\bm{S}_p$ .
Object Proposal Generation: For each predicted object center, a spherical region (proposal) is generated using a ball query operation. The radius of the sphere is set to 1.2 times the radius of the model point cloud $\bm{Q}$ . This proposal $\bm{O}$ contains points from the scene point cloud potentially belonging to an instance.

Implementation Considerations for Stage 1:

KPConv is used for efficient point cloud feature extraction on varying densities.
Attention mechanisms help capture relationships between the model shape and potential instances in the scene.
Predicting offsets and masks per point is a common strategy in 3D object detection/segmentation.
DBSCAN is a density-based clustering algorithm suitable for grouping the displaced points into distinct instance centers.
The radius for ball query needs to be set appropriately (e.g., based on the model size) to ensure proposals cover the entire potential object instance.

Stage 2: 3D Dual-Masking Instance Matching Module

This module takes an object proposal $\bm{O}$ (generated from Stage 1) and the model point cloud $\bm{Q}$ and performs pairwise registration to estimate the transformation.

Feature Extraction: A smaller KPConv encoder extracts features $\bm{E}_o$ from the object proposal $\bm{O}$ and $\bm{E}_q$ from the model $\bm{Q}$ .
Feature Correlation: Similar to Stage 1, self-attention on $\bm{E}_o$ and $\bm{E}_q$ , followed by cross-attention between $\bm{E}_o$ and $\bm{E}_q$ to get enhanced features $\bm{Z}_o$ for the object proposal, encoding its relationship with the model.
Instance Mask Prediction: An MLP predicts an instance mask $\bm{Y}_o$ from $\bm{Z}_o$ (and $\bm{G}_o$ ), refining the mask from Stage 1 to better isolate the instance points within the proposal $\bm{O}$ .
Overlap Mask Prediction: An MLP predicts an overlap mask $\bm{Y}_{op}$ from $\bm{Z}_o$ , identifying points in the proposal $\bm{O}$ that correspond to the potentially incomplete object shape (the part that overlaps with the transformed model). This is crucial for handling partial observations due to occlusion. The mask is upsampled to the original resolution of the proposal $\bm{O}$ .
Pairwise Matching & Pose Estimation:
- Correspondences are extracted between the masked object proposal (filtered by $\bm{Y}_o$ and $\bm{Y}_{op}$ ) and the model $\bm{Q}$ .
- The paper adopts a method similar to [qin2022geometric], using an optimal transport layer and mutual top-k selection to find dense correspondences within local patches.
- Crucially, the instance and overlap masks are used to filter out background noise and non-overlapping points within these patches, leading to more reliable correspondence sets.
- A local-to-global registration method (like SVD on the filtered correspondences) is then used to estimate the rigid transformation (rotation and translation) between the object proposal and the model.

Implementation Considerations for Stage 2:

The dual masking (instance and overlap) is a key novelty for handling cluttered scenes and partial observations.
Using optimal transport for correspondence allows for differentiable end-to-end training of the matching part.
Filtering correspondences based on learned masks is critical for robustness against outliers and missing data.
Pose estimation from correspondences typically involves robust methods like RANSAC or directly using SVD on inliers.

Loss Functions

The model is trained with a combination of losses:

Focusing Loss: Encourages learning discriminative features ( $L_{circle}$ ), accurate offset prediction ( $L_{reg}$ - L1 distance to GT center displacement), and correct offset direction ( $L_{dir}$ - negative cosine similarity).
Matching Loss: Includes $L_{circle}$ for features, negative log-likelihood ( $L_{nll}$ ) for the learned correspondence assignment matrix, and binary cross-entropy + Dice loss ( $L_{mask}$ ) for both instance and overlap masks.

Practical Applications

The proposed 3DFMNet is directly applicable to scenarios requiring precise alignment of known 3D models to multiple instances in a noisy, cluttered scene:

Robotic Bin Picking: A robot needs to identify and grasp specific objects from a bin. This requires accurately estimating the pose of each instance of the target object within the bin's point cloud. ROBI dataset experiments demonstrate strong performance in such industrial settings.
Augmented Reality/Mixed Reality: Overlaying digital models onto a real-world scene requires registering the CAD model(s) to their physical counterparts, which might appear multiple times.
Industrial Inspection: Aligning CAD models of parts to scan data to check for defects or deviations, even when multiple parts are present and potentially occluded.
Scene Understanding: Locating and posing known objects within a scanned environment (like in Scan2CAD dataset).

Implementation Aspects and Trade-offs

Computational Requirements: The two-stage approach adds sequential processing time. Stage 1 processes the whole scene to find proposals, and Stage 2 processes each proposal independently. While slightly slower than some one-stage methods, it is faster than other two-stage approaches and achieves higher accuracy, especially in challenging scenarios. The use of KPConv and attention might require moderate GPU memory. The number of sampled points (e.g., 4096) impacts processing speed and accuracy (as shown in ablation studies).
Data: Requires paired data of model point clouds and scenes containing multiple instances with ground truth poses. Datasets like Scan2CAD and ROBI are used for training and evaluation. Ground truth centers are used as supervision during training of the focusing module.
Hyperparameters: Parameters like ball query radius (based on model size), clustering parameters (DBSCAN epsilon, min_samples), and loss weights need tuning based on the dataset characteristics.
Limitations: The performance of the matching stage is dependent on the quality of proposals from the focusing stage. If an instance is missed or poorly localized in Stage 1, it cannot be registered in Stage 2. The paper analyzes this by showing the upper bound performance when using ground truth centers, indicating room for improvement in the focusing module's recall for highly challenging scenes (like ROBI).

In summary, 3DFMNet provides a robust and effective solution for multi-instance point cloud registration by decoupling the problem into object localization and subsequent pair-wise matching with dual masking. This approach achieves state-of-the-art results on challenging benchmarks and offers a practical framework for applications involving cluttered and occluded 3D environments.

Markdown