Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes

Published 6 Apr 2024 in cs.CV | (2404.04557v1)

Abstract: Multi-instance point cloud registration estimates the poses of multiple instances of a model point cloud in a scene point cloud. Extracting accurate point correspondence is to the center of the problem. Existing approaches usually treat the scene point cloud as a whole, overlooking the separation of instances. Therefore, point features could be easily polluted by other points from the background or different instances, leading to inaccurate correspondences oblivious to separate instances, especially in cluttered scenes. In this work, we propose MIRETR, Multi-Instance REgistration TRansformer, a coarse-to-fine approach to the extraction of instance-aware correspondences. At the coarse level, it jointly learns instance-aware superpoint features and predicts per-instance masks. With instance masks, the influence from outside of the instance being concerned is minimized, such that highly reliable superpoint correspondences can be extracted. The superpoint correspondences are then extended to instance candidates at the fine level according to the instance masks. At last, an efficient candidate selection and refinement algorithm is devised to obtain the final registrations. Extensive experiments on three public benchmarks demonstrate the efficacy of our approach. In particular, MIRETR outperforms the state of the arts by 16.6 points on F1 score on the challenging ROBI benchmark. Code and models are available at https://github.com/zhiyuanYU134/MIRETR.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper presents MIRETR, a novel transformer-based method that learns instance-aware correspondences for robust multi-instance point cloud registration.
It employs a coarse-to-fine process with geometric encoding, cross-attention, and instance mask refinement to overcome occlusion and background contamination.
Experimental results on Scan2CAD, ROBI, and ShapeNet show significant improvements in accuracy and efficiency compared to traditional multi-model fitting approaches.

This paper introduces MIRETR (Multi-Instance REgistration TRansformer), a novel coarse-to-fine method for registering multiple instances of a model point cloud within a larger, cluttered scene point cloud. The core challenge addressed is the accurate extraction of point correspondences that are aware of individual object instances, which is crucial for reliable pose estimation, especially when instances are occluded or closely packed.

Traditional methods often treat the scene as a whole, leading to feature contamination from background or other instances, making it difficult to register heavily occluded or geometrically deficient instances. MIRETR tackles this by learning instance-aware correspondences.

Method Overview

MIRETR operates in a coarse-to-fine manner:

1. Coarse Level - Instance-aware Geometric Transformer: This module establishes correspondences between downsampled superpoints (sparse representations of local point cloud patches). It iteratively refines superpoint features and predicts instance masks to ensure features are learned primarily from within the relevant instance.

Fine Level - Instance Candidate Generation: Superpoint correspondences from the coarse level are expanded into full instance candidates using the predicted instance masks. Dense point correspondences are then extracted within these candidates to estimate an initial pose for each.
Candidate Selection and Refinement: A non-maximum suppression (NMS)-like algorithm filters out duplicate instance candidates and refines the poses of the remaining ones to produce the final registration results. This bypasses the need for traditional multi-model fitting algorithms.

1. Instance-aware Geometric Transformer (Coarse Level)

The key innovation at this stage is to make superpoint features instance-aware. Instead of global attention or simple local attention (which can still mix features from different instances), MIRETR restricts intra-point-cloud context encoding to within each instance's scope.

This module consists of three main blocks, iterated $N_t$ times (e.g., $N_t=3$ ):

Geometric Encoding Block: This block encodes intra-instance geometric context. For a superpoint in the scene $q_i \in \hat{Q}$ , its features are updated by attending to its k-nearest neighbors $\mathcal{N}_{\hat{Q}}$ . Crucially, the attention scores $e_{i,j}$ are modified by an instance mask term $m_{i,j}^Q$ :

$z_i^{\hat{Q}} = \sum_{j=1}^{k} \frac{\exp(e_{i,j})}{\sum_{l=1}^{k} \exp(e_{i,l})} (X_{\hat{Q}}W_V)$

$e_{i,j} = \frac{(x_i^{\hat{Q}}W_Q)(x_j^{\hat{Q}}W_K + r_{i,j}W_R)^T}{\sqrt{d}} + m_{i,j}^Q$

Here, $m_{i,j}^Q = 0$ if $q_i$ and $q_j$ are in the same instance, and $-\infty$ otherwise. $M^Q$ is initialized to all zeros and refined iteratively. $r_{i,j}$ is a geometric structure embedding. For the model point cloud $\hat{P}$ , the mask term is ignored.
Cross-Attention Block: This block models inter-point-cloud geometric consistency, allowing features from one point cloud to be aware of the structure in the other. This is similar to standard transformer cross-attention mechanisms.
Instance Masking Block: This block refines the instance masks $M^Q$ . It uses a geodesic self-attention mechanism (replacing geometric embedding with geodesic distance embedding $g_{i,j}$ ) on the scene superpoint features $Y^{\hat{Q}}$ . An MLP then predicts a confidence score $u_{i,j}$ for each neighbor $q_j$ belonging to the same instance as $q_i$ , based on feature discrepancy and geodesic distance:

$u_{i,j} = \sigma(\text{MLP}([y_i^{\hat{Q}} - y_j^{\hat{Q}}; g_{i,j}]))$

The confidence matrix $U$ is thresholded (at threshold $T$ ) to update $M^Q$ .

After $N_t$ iterations, the top $N_c$ superpoint pairs with the highest cosine feature similarity are selected as superpoint correspondences $\mathcal{C}_{sp}$ .

2. Instance Candidate Generation (Fine Level)

For each superpoint correspondence $(\hat{p}_k, \hat{q}_k) \in \mathcal{C}_{sp}$ :

Collect neighboring superpoints $N_{\hat{P}}$ for $\hat{p}_k$ and $N_{\hat{Q}}$ for $\hat{q}_k$ .
Filter $N_{\hat{Q}}$ using the refined instance mask $M_k^Q$ corresponding to $\hat{q}_k$ to remove superpoints from different instances.
The points within the local patches of all superpoints in $N_{\hat{P}}$ and the filtered $N_{\hat{Q}}$ form an instance candidate $I_k$ .
Extract dense point correspondences $C_k$ within $I_k$ using an optimal transport layer and mutual top-k selection.
Estimate a pose $T_k = \{R_k, t_k\}$ for $I_k$ by solving $\min_{R_k,t_k} \sum_{(p_i,q_i) \in C_k} ||R_k p_i + t_k - q_i||^2$ using weighted SVD.

This approach, by leveraging instance masks, allows for the extraction of correspondences covering a larger portion of an instance, leading to more accurate initial pose estimates.

3. Candidate Selection and Refinement

To handle duplicated instance candidates:

Sort candidates by their inlier ratio on the global point correspondences $C = \bigcup_k C_k$ . The inlier ratio for candidate $I_k$ with pose $T_k$ is:

$n_k = \frac{1}{|C|} \sum_{(p,q) \in C} [||R_k p + t_k - q|| < \tau_2]$

where $\tau_2$ is an acceptance radius.
Select the candidate with the highest $n_k$ as an anchor.
Merge all remaining candidates similar to the anchor. Similarity $s_{i,j}$ between candidates $I_i$ and $I_j$ is based on the Average Distance (ADD) of their poses $T_i, T_j$ :

$s_{i,j} = 1 - \frac{\text{ADD}(T_i, T_j)}{r_{diam}}$

where $r_{diam}$ is the diameter of the model point cloud $P$ . Candidates are similar if $s_{i,j} > T_s$ (similarity threshold).
When merging, combine point correspondences and recompute the pose.
The preserved pose is iteratively refined using surviving inliers.
Remove the anchor and merged candidates; repeat until no candidates remain. Registrations with too few inliers are discarded.

This method avoids complex multi-model fitting, offering efficiency and accuracy.

Loss Functions

MIRETR is trained using three losses:

$\mathcal{L}_{circle}$ : An overlap-aware circle loss for supervising superpoint features.
$\mathcal{L}_{nll}$ : A negative log-likelihood loss for supervising dense point matching (within instance candidates).
$\mathcal{L}_{mask}$ : A mask prediction loss (BCE + Dice loss) for supervising instance mask prediction. The total loss is $\mathcal{L} = \mathcal{L}_{circle} + \mathcal{L}_{nll} + \mathcal{L}_{mask}$ .

Implementation Details

Backbone: KPConv-FPN [47] is used for multi-level feature extraction. Input point clouds are voxel-grid filtered (e.g., 2.5cm for Scan2CAD, 0.15cm for ROBI). A 4-stage backbone is used.
Instance-aware Geometric Transformer:
- Superpoint features from backbone (e.g., 1024-dim) are projected to 256-dim.
- $N_t = 3$ transformer modules are used.
- Geometric structure embedding uses parameters like $o_d=0.2m$ (distance) and $o_a=15^{\circ}$ (angle).
- Geodesic embedding for instance masking uses $o_{geo}=0.1m$ .
- The geodesic distance embedding $g_{i,j}$ is computed using sinusoidal positional encoding on the geodesic distance $G_{i,j}$ and projected by $W_G$ .
Training:
- Adam optimizer, initial LR $10^{-4}$ , momentum 0.98, weight decay $10^{-6}$ .
- LR exponentially decayed by 0.05 per epoch.
- Epochs: 60 (Scan2CAD, ROBI), 40 (ShapeNet).
- Matching radius $\tau_1$ for GT matches: 5cm (Scan2CAD, ShapeNet), 0.3cm (ROBI).
- Number of neighbors for attention: 16 (Scan2CAD, ShapeNet), 32 (ROBI).
- Mask confidence threshold $T=0.6$ .
- $N_g=128$ GT superpoint matches sampled during training.
Testing:
- $N_c=128$ superpoint matches sampled.
- Candidate selection similarity threshold $T_s$ : 0.7 (ROBI), 0.8 (Scan2CAD, ShapeNet).
- Acceptance radius $\tau_2$ : 5cm (Scan2CAD, ShapeNet), 0.3cm (ROBI).
- Minimum inlier threshold factor $\tau_3$ : 0.2 (ROBI), 0.8 (Scan2CAD, ShapeNet).

Architecture Diagram (Conceptual)

graph TD
    A[Input Model P, Scene Q] --> B{KPConv-FPN Backbone};
    B --> C1[Superpoints P_hat, Q_hat];
    B --> C2[Dense Points P, Q];
    C1 --> D{Instance-aware Geometric Transformer};
    D -- Iterative Refinement --> D;
    D --> E[Instance Masks M_Q];
    D --> F[Superpoint Correspondences C_sp];
    F & E & C2 --> G{Instance Candidate Generation};
    G --> H[Instance Candidates I_k with Poses T_k];
    H --> I{Candidate Selection & Refinement};
    I --> J[Final Registrations];

    subgraph Instance-aware Geometric Transformer
        direction LR
        K[Input Superpoint Features] --> L{Geometric Encoding Block (with M_Q_prev)};
        L --> M{Cross-Attention Block};
        M --> N{Instance Masking Block};
        N --> O[Refined M_Q];
        O --> L;
    end

Experimental Results

MIRETR was evaluated on:

Scan2CAD: Indoor scene-to-CAD alignment. MIRETR (full pipeline) achieved MF 93.40%, outperforming GeoTransformer+PointCLM (82.35%) and GeoTransformer+ECC (83.57%).
ROBI: Industrial bin-picking with cluttered, reflective objects. This is a challenging benchmark. MIRETR (full pipeline) achieved MF 39.80%, significantly better than GeoTransformer+PointCLM (18.33%) and GeoTransformer+ECC (23.20%). It registered an average of 13.7 instances per scene vs. 6.2 for PointCLM-based model. MIRETR showed significant improvements for low-overlap instances (<50%). The predicted instance masks achieved an mIoU of 69.26%.
ShapeNet: Synthetic CAD models to test generalization to novel categories. MIRETR (full pipeline) achieved MF 94.44%, surpassing GeoTransformer+PointCLM (85.71%) and GeoTransformer+ECC (87.01%).
ModelNet40 (Appendix): Tested generalization to unseen object categories. MIRETR (full pipeline) achieved MF 99.94%.

Key Findings from Experiments:

Effectiveness of Instance Awareness: The core contribution—learning instance-aware features and masks—is crucial for performance, especially in cluttered scenes like ROBI. Ablation studies confirmed that removing instance-aware point matching or the instance masking block significantly degrades performance (e.g., on ROBI, full MIRETR MF 39.80%, w/o instance masking MF 24.80%, w/ global attention MF 10.96%).
Superiority over Multi-Model Fitting: The proposed candidate selection and refinement often outperforms combinations of state-of-the-art correspondence extractors with multi-model fitting algorithms, while also being more efficient in the pose estimation stage.
Robustness to Occlusion and Clutter: Qualitative results (Fig. 5) show MIRETR registering more instances, including heavily occluded ones with geometric deficiencies, compared to GeoTransformer.
Generalization: Strong performance on ShapeNet and ModelNet40 (testing on unseen categories) indicates good generalization capabilities.
Time Efficiency: While correspondence extraction is slightly slower than some baselines, the pose estimation stage is significantly faster (e.g., 0.10s for MIRETR full pipeline vs. 0.22s for GeoTransformer+PointCLM on ROBI).

Practical Applications

Robotic Bin Picking: Directly applicable for identifying and locating multiple identical objects in a bin for grasping. The ROBI benchmark results are particularly relevant here.
Augmented Reality: Accurately placing virtual objects corresponding to real-world instances in a scene.
Scene Understanding: Decomposing a complex 3D scene into known object instances and their poses.
Industrial Automation: Object localization for assembly, inspection, or sorting tasks where multiple instances of a part might be present.

Limitations

Rotation Invariance: KPConv, the backbone used, may struggle with large rotations, potentially limiting performance in scenarios with extreme orientation changes.
Uneven Superpoint Sampling: MIRETR might sample different superpoints on the same instance across different views or iterations, which could affect matching.
Extreme Occlusion/Clutter: The paper notes (Fig. 9) that MIRETR can still fail in cases of extreme clustering and severe occlusion.

Conclusion

MIRETR presents a significant advancement in multi-instance point cloud registration by introducing an effective mechanism for learning instance-aware correspondences. Its coarse-to-fine architecture, particularly the Instance-aware Geometric Transformer that iteratively refines features and instance masks, allows it to robustly handle cluttered scenes and occluded objects. The method's ability to bypass traditional multi-model fitting also contributes to its efficiency and effectiveness. The extensive experiments demonstrate state-of-the-art performance on several challenging benchmarks. Future work aims to extend MIRETR to multi-modal multi-instance registration.

Markdown