Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes
Published 6 Apr 2024 in cs.CV | (2404.04557v1)
Abstract: Multi-instance point cloud registration estimates the poses of multiple instances of a model point cloud in a scene point cloud. Extracting accurate point correspondence is to the center of the problem. Existing approaches usually treat the scene point cloud as a whole, overlooking the separation of instances. Therefore, point features could be easily polluted by other points from the background or different instances, leading to inaccurate correspondences oblivious to separate instances, especially in cluttered scenes. In this work, we propose MIRETR, Multi-Instance REgistration TRansformer, a coarse-to-fine approach to the extraction of instance-aware correspondences. At the coarse level, it jointly learns instance-aware superpoint features and predicts per-instance masks. With instance masks, the influence from outside of the instance being concerned is minimized, such that highly reliable superpoint correspondences can be extracted. The superpoint correspondences are then extended to instance candidates at the fine level according to the instance masks. At last, an efficient candidate selection and refinement algorithm is devised to obtain the final registrations. Extensive experiments on three public benchmarks demonstrate the efficacy of our approach. In particular, MIRETR outperforms the state of the arts by 16.6 points on F1 score on the challenging ROBI benchmark. Code and models are available at https://github.com/zhiyuanYU134/MIRETR.
The paper presents MIRETR, a novel transformer-based method that learns instance-aware correspondences for robust multi-instance point cloud registration.
It employs a coarse-to-fine process with geometric encoding, cross-attention, and instance mask refinement to overcome occlusion and background contamination.
Experimental results on Scan2CAD, ROBI, and ShapeNet show significant improvements in accuracy and efficiency compared to traditional multi-model fitting approaches.
This paper introduces MIRETR (Multi-Instance REgistration TRansformer), a novel coarse-to-fine method for registering multiple instances of a model point cloud within a larger, cluttered scene point cloud. The core challenge addressed is the accurate extraction of point correspondences that are aware of individual object instances, which is crucial for reliable pose estimation, especially when instances are occluded or closely packed.
Traditional methods often treat the scene as a whole, leading to feature contamination from background or other instances, making it difficult to register heavily occluded or geometrically deficient instances. MIRETR tackles this by learning instance-aware correspondences.
Method Overview
MIRETR operates in a coarse-to-fine manner:
1. Coarse Level - Instance-aware Geometric Transformer: This module establishes correspondences between downsampled superpoints (sparse representations of local point cloud patches). It iteratively refines superpoint features and predicts instance masks to ensure features are learned primarily from within the relevant instance.
Fine Level - Instance Candidate Generation: Superpoint correspondences from the coarse level are expanded into full instance candidates using the predicted instance masks. Dense point correspondences are then extracted within these candidates to estimate an initial pose for each.
Candidate Selection and Refinement: A non-maximum suppression (NMS)-like algorithm filters out duplicate instance candidates and refines the poses of the remaining ones to produce the final registration results. This bypasses the need for traditional multi-model fitting algorithms.
The key innovation at this stage is to make superpoint features instance-aware. Instead of global attention or simple local attention (which can still mix features from different instances), MIRETR restricts intra-point-cloud context encoding to within each instance's scope.
This module consists of three main blocks, iterated Nt times (e.g., Nt=3):
Geometric Encoding Block: This block encodes intra-instance geometric context. For a superpoint in the scene qi∈Q^, its features are updated by attending to its k-nearest neighbors NQ^. Crucially, the attention scores ei,j are modified by an instance mask term mi,jQ:
Here, mi,jQ=0 if qi and qj are in the same instance, and −∞ otherwise. MQ is initialized to all zeros and refined iteratively. ri,j is a geometric structure embedding. For the model point cloud P^, the mask term is ignored.
Cross-Attention Block: This block models inter-point-cloud geometric consistency, allowing features from one point cloud to be aware of the structure in the other. This is similar to standard transformer cross-attention mechanisms.
Instance Masking Block: This block refines the instance masks MQ. It uses a geodesic self-attention mechanism (replacing geometric embedding with geodesic distance embedding gi,j) on the scene superpoint features YQ^. An MLP then predicts a confidence score ui,j for each neighbor qj belonging to the same instance as qi, based on feature discrepancy and geodesic distance:
ui,j=σ(MLP([yiQ^−yjQ^;gi,j]))
The confidence matrix U is thresholded (at threshold T) to update MQ.
After Nt iterations, the top Nc superpoint pairs with the highest cosine feature similarity are selected as superpoint correspondences Csp.
2. Instance Candidate Generation (Fine Level)
For each superpoint correspondence (p^k,q^k)∈Csp:
Collect neighboring superpoints NP^ for p^k and NQ^ for q^k.
Filter NQ^ using the refined instance mask MkQ corresponding to q^k to remove superpoints from different instances.
The points within the local patches of all superpoints in NP^ and the filtered NQ^ form an instance candidate Ik.
Extract dense point correspondences Ck within Ik using an optimal transport layer and mutual top-k selection.
Estimate a pose Tk={Rk,tk} for Ik by solving Rk,tkmin(pi,qi)∈Ck∑∣∣Rkpi+tk−qi∣∣2 using weighted SVD.
This approach, by leveraging instance masks, allows for the extraction of correspondences covering a larger portion of an instance, leading to more accurate initial pose estimates.
3. Candidate Selection and Refinement
To handle duplicated instance candidates:
Sort candidates by their inlier ratio on the global point correspondences C=⋃kCk. The inlier ratio for candidate Ik with pose Tk is:
nk=∣C∣1(p,q)∈C∑[∣∣Rkp+tk−q∣∣<τ2]
where τ2 is an acceptance radius.
Select the candidate with the highest nk as an anchor.
Merge all remaining candidates similar to the anchor. Similarity si,j between candidates Ii and Ij is based on the Average Distance (ADD) of their poses Ti,Tj:
si,j=1−rdiamADD(Ti,Tj)
where rdiam is the diameter of the model point cloud P. Candidates are similar if si,j>Ts (similarity threshold).
When merging, combine point correspondences and recompute the pose.
The preserved pose is iteratively refined using surviving inliers.
Remove the anchor and merged candidates; repeat until no candidates remain. Registrations with too few inliers are discarded.
This method avoids complex multi-model fitting, offering efficiency and accuracy.
Loss Functions
MIRETR is trained using three losses:
Lcircle: An overlap-aware circle loss for supervising superpoint features.
Lnll: A negative log-likelihood loss for supervising dense point matching (within instance candidates).
Lmask: A mask prediction loss (BCE + Dice loss) for supervising instance mask prediction.
The total loss is L=Lcircle+Lnll+Lmask.
Implementation Details
Backbone: KPConv-FPN [47] is used for multi-level feature extraction. Input point clouds are voxel-grid filtered (e.g., 2.5cm for Scan2CAD, 0.15cm for ROBI). A 4-stage backbone is used.
Instance-aware Geometric Transformer:
Superpoint features from backbone (e.g., 1024-dim) are projected to 256-dim.
Nt=3 transformer modules are used.
Geometric structure embedding uses parameters like od=0.2m (distance) and oa=15∘ (angle).
Geodesic embedding for instance masking uses ogeo=0.1m.
The geodesic distance embedding gi,j is computed using sinusoidal positional encoding on the geodesic distance Gi,j and projected by WG.
Training:
Adam optimizer, initial LR 10−4, momentum 0.98, weight decay 10−6.
ROBI: Industrial bin-picking with cluttered, reflective objects. This is a challenging benchmark. MIRETR (full pipeline) achieved MF 39.80%, significantly better than GeoTransformer+PointCLM (18.33%) and GeoTransformer+ECC (23.20%). It registered an average of 13.7 instances per scene vs. 6.2 for PointCLM-based model. MIRETR showed significant improvements for low-overlap instances (<50%). The predicted instance masks achieved an mIoU of 69.26%.
ShapeNet: Synthetic CAD models to test generalization to novel categories. MIRETR (full pipeline) achieved MF 94.44%, surpassing GeoTransformer+PointCLM (85.71%) and GeoTransformer+ECC (87.01%).
Effectiveness of Instance Awareness: The core contribution—learning instance-aware features and masks—is crucial for performance, especially in cluttered scenes like ROBI. Ablation studies confirmed that removing instance-aware point matching or the instance masking block significantly degrades performance (e.g., on ROBI, full MIRETR MF 39.80%, w/o instance masking MF 24.80%, w/ global attention MF 10.96%).
Superiority over Multi-Model Fitting: The proposed candidate selection and refinement often outperforms combinations of state-of-the-art correspondence extractors with multi-model fitting algorithms, while also being more efficient in the pose estimation stage.
Robustness to Occlusion and Clutter: Qualitative results (Fig. 5) show MIRETR registering more instances, including heavily occluded ones with geometric deficiencies, compared to GeoTransformer.
Generalization: Strong performance on ShapeNet and ModelNet40 (testing on unseen categories) indicates good generalization capabilities.
Time Efficiency: While correspondence extraction is slightly slower than some baselines, the pose estimation stage is significantly faster (e.g., 0.10s for MIRETR full pipeline vs. 0.22s for GeoTransformer+PointCLM on ROBI).
Practical Applications
Robotic Bin Picking: Directly applicable for identifying and locating multiple identical objects in a bin for grasping. The ROBI benchmark results are particularly relevant here.
Augmented Reality: Accurately placing virtual objects corresponding to real-world instances in a scene.
Scene Understanding: Decomposing a complex 3D scene into known object instances and their poses.
Industrial Automation: Object localization for assembly, inspection, or sorting tasks where multiple instances of a part might be present.
Limitations
Rotation Invariance: KPConv, the backbone used, may struggle with large rotations, potentially limiting performance in scenarios with extreme orientation changes.
Uneven Superpoint Sampling: MIRETR might sample different superpoints on the same instance across different views or iterations, which could affect matching.
Extreme Occlusion/Clutter: The paper notes (Fig. 9) that MIRETR can still fail in cases of extreme clustering and severe occlusion.
Conclusion
MIRETR presents a significant advancement in multi-instance point cloud registration by introducing an effective mechanism for learning instance-aware correspondences. Its coarse-to-fine architecture, particularly the Instance-aware Geometric Transformer that iteratively refines features and instance masks, allows it to robustly handle cluttered scenes and occluded objects. The method's ability to bypass traditional multi-model fitting also contributes to its efficiency and effectiveness. The extensive experiments demonstrate state-of-the-art performance on several challenging benchmarks. Future work aims to extend MIRETR to multi-modal multi-instance registration.