Deep Interaction Transformer (DIT)
- The paper introduces DIT, a robust Transformer-based architecture for point cloud registration that integrates global structure extraction with deep cross-attention.
- It employs a three-stage pipeline combining point cloud structure extraction, deep-narrow cross-attention with learned positional encoding, and geometric filtering for confident correspondence.
- DIT achieves state-of-the-art alignment accuracy, significantly reducing errors in both clean and noisy, partial point cloud registration compared to prior methods.
The Deep Interaction Transformer (DIT) is a full Transformer-based architecture designed for robust point cloud registration. It addresses key limitations of prior approaches in feature distinctiveness, noise robustness, and outlier handling by introducing a global structure extractor, deep cross-attention with learned positional encoding, and geometry-aware correspondence filtering. DIT achieves state-of-the-art results on both clean and challenging partial/noisy point cloud registration tasks (Chen et al., 2021).
1. Architectural Overview
DIT is a three-stage pipeline that processes source and target point clouds:
- Point Cloud Structure Extractor (PSE): Models local and global geometric relations to output per-point features , .
- Point Feature Transformer (PFT): Deep stack of cross-attention layers with learned positional encodings, generating enriched features .
- Geometric Matching–based Correspondence Confidence Evaluation (GMCCE): Assigns geometric-consistency-based inlier confidence to each tentative correspondence, followed by a weighted Procrustes alignment.
The pipeline: feature-based matching GMCCE confidence weighted Procrustes .
2. Point Cloud Structure Extractor (PSE)
The PSE module captures both local and global structural context.
- Local Feature Integrator (LFI): For each , the nearest neighbors are retrieved, and their features concatenated: .
- Transformer Encoder: The LFI output is processed by stacked MSA (multi-head self-attention) Transformer blocks, using heads and hidden dimension :
Outputs are summed with residuals, processed by layer normalization, and stacked across layers.
- Feature Aggregation: Outputs are concatenated and passed through a final LN+ReLU to yield .
This combination enables modeling of long-range dependencies and order-invariant geometric structures, which classical point cloud CNNs fail to capture.
3. Deep-Narrow Point Feature Transformer (PFT)
PFT establishes associations between and via deep, stacked Transformer cross-attention:
- Learned Positional Encoding: Each point receives a position embedding generated by a two-layer MLP, . Encoded features: , .
- Deep Stacked Layers: Each of –$8$ cross-attention layers applies:
- Intra-cloud self-attention
- Inter-cloud cross-attention (e.g., )
- Residual + MLP
- Feature Fusion and Squeeze-and-Excitation: Final features , fuse PSE and PFT outputs; an SE module produces .
- "Deep-narrow" refers to many stacked layers of moderate width, enhancing non-local feature discrimination in comparison to shallow-wide Transformer variants.
The deep repeated cross-interaction augmented by explicit positional encoding corrects for indistinct feature extraction and enables direct learning of relative point displacements.
4. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE)
GMCCE filters putative correspondences by measuring geometric consistency under rigid motion:
- Triangulated Descriptor: For each tentative match :
- The neighbors of define triangles .
- Each triangle is mapped to via corresponding matches, forming .
- For each triangle, the side-length vectors are computed and compared:
- For each pair, is the Minkowski sum over the smallest .
- Confidence Scoring:
with , where thresholds out very low-confidence matches.
Only high-confidence matches contribute to the weighted Procrustes estimation, robustly rejecting outliers and boosting transformation accuracy in the presence of noise and partial overlaps.
5. Training Regimen and Loss Functions
DIT is trained using a composite objective:
- Transformation Loss : Penalizes deviation between predicted () and ground-truth () transformations.
- Cycle Consistency Loss : Encourages learned forward and inverse transforms to be consistent.
- Discrimination Loss : Feature-matching cross-entropy using geometric confidence-weighted correspondences.
Typical hyperparameters: , .
Dataset: ModelNet40 (12,311 models, 1,024 points per cloud, 80/20 split). Inputs are randomly rotated (), translated (), and subsampled to 60% overlap for partial-to-partial registration. Noise sampled from ("low") and ("high"), with prescribed clipping.
Optimization uses Adam with , batch size 16.
6. Empirical Evaluation and Ablation
Quantitative Results:
- On clean data: DIT achieves °, .
- On low-noise, partial-to-partial: DIT attains °, .
- On high-noise, partial-to-partial: °, . These performance levels represent 50–100× reductions in registration error versus prior methods on clean data, and 2–5× reductions on noisy, partial datasets.
Success Rate Analysis: With thresholds (rotation error) and (translation error), DIT consistently achieves ≈100% success, outperforming RGM, DCP, DeepGMR, and others by margins of 5–30%.
Ablation Studies:
| Variant | Success Rate | ||
|---|---|---|---|
| Full DIT | 1.41° | 0.009 | 94.7% |
| w/o PSE | 41.84° | 0.247 | 1.2% |
| w/ DGCNN | 18.07° | 0.069 | 7.5% |
| w/o PE (pos. enc) | 18.02° | 0.091 | 55.3% |
| w/o GMCCE | 2.36° | 0.016 | 74.4% |
Key observations: PSE is essential for robustness to noise; learned positional encoding in PFT contributes approximately 40% to the success rate; GMCCE filtering adds 40–70% accuracy under challenging conditions.
Qualitative Results: DIT maintains precise alignment on clean, noisy, and partial point clouds, even with 40% points missing or noise . Competing methods (DCP, DeepGMR) show notable misalignments under these conditions.
7. Comparative Performance and Core Contributions
A summary comparison for partial, low-noise and high-noise registration:
| Method | Partial-60% Low Noise | Partial-60% High Noise |
|---|---|---|
| ICP/FGR/RPM-Net | ||
| DCP | 4.43° | 12.29° |
| DeepGMR | 7.15° | 8.96° |
| RGM | 0.74° | 2.07° |
| DIT | 0.014° | 1.412° |
Principal advances:
- PSE's incorporation of global relationships mitigates feature ambiguity and provides noise resilience.
- PFT's deep-narrow cross-attention with learned position encoding establishes context-dependent, discriminative features for robust matching.
- GMCCE confidence evaluation leverages strict geometric consistency, effectively filtering outliers otherwise degrading alignment quality.
In aggregate, DIT achieves state-of-the-art performance for robust point cloud registration on both clean and heavily corrupted input, advancing beyond limitations of convolutional, shallow Transformer, and purely feature-based correspondence models (Chen et al., 2021).