Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Interaction Transformer (DIT)

Updated 6 February 2026
  • The paper introduces DIT, a robust Transformer-based architecture for point cloud registration that integrates global structure extraction with deep cross-attention.
  • It employs a three-stage pipeline combining point cloud structure extraction, deep-narrow cross-attention with learned positional encoding, and geometric filtering for confident correspondence.
  • DIT achieves state-of-the-art alignment accuracy, significantly reducing errors in both clean and noisy, partial point cloud registration compared to prior methods.

The Deep Interaction Transformer (DIT) is a full Transformer-based architecture designed for robust point cloud registration. It addresses key limitations of prior approaches in feature distinctiveness, noise robustness, and outlier handling by introducing a global structure extractor, deep cross-attention with learned positional encoding, and geometry-aware correspondence filtering. DIT achieves state-of-the-art results on both clean and challenging partial/noisy point cloud registration tasks (Chen et al., 2021).

1. Architectural Overview

DIT is a three-stage pipeline that processes source XRN×3X\in\mathbb{R}^{N\times3} and target YRM×3Y\in\mathbb{R}^{M\times3} point clouds:

  1. Point Cloud Structure Extractor (PSE): Models local and global geometric relations to output per-point features FXRN×dF_X\in\mathbb{R}^{N\times d}, FYRM×dF_Y\in\mathbb{R}^{M\times d}.
  2. Point Feature Transformer (PFT): Deep stack of cross-attention layers with learned positional encodings, generating enriched features ΦX,ΦY\Phi_X,\Phi_Y.
  3. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE): Assigns geometric-consistency-based inlier confidence to each tentative correspondence, followed by a weighted Procrustes alignment.

The pipeline: X,YPSEFX,FYPFTΦX,ΦYX, Y \to \text{PSE} \to F_X, F_Y \to \text{PFT} \to \Phi_X, \Phi_Y \to feature-based matching \to GMCCE confidence \to weighted Procrustes (R,t)\to (R, t).

2. Point Cloud Structure Extractor (PSE)

The PSE module captures both local and global structural context.

  • Local Feature Integrator (LFI): For each xix_i, the k=20k=20 nearest neighbors are retrieved, and their features concatenated: Tn(i)=Concat[TPni1,...,TPnik]RkdinT_n'(i) = \mathrm{Concat}[T_{P_n^i}^1, ..., T_{P_n^i}^k] \in \mathbb{R}^{k d_{in}}.
  • Transformer Encoder: The LFI output is processed by stacked MSA (multi-head self-attention) Transformer blocks, using h=4h=4 heads and hidden dimension d=64d=64:

Att(Q,K,V)=softmax(QKdK)V\mathrm{Att}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_K}}\right)V

Outputs are summed with residuals, processed by layer normalization, and stacked across N=3N_\ell=3 layers.

  • Feature Aggregation: Outputs T2,...,TN+1T_2,...,T_{N_\ell+1} are concatenated and passed through a final LN+ReLU to yield FX,FYF_X, F_Y.

This combination enables modeling of long-range dependencies and order-invariant geometric structures, which classical point cloud CNNs fail to capture.

3. Deep-Narrow Point Feature Transformer (PFT)

PFT establishes associations between XX and YY via deep, stacked Transformer cross-attention:

  • Learned Positional Encoding: Each point receives a position embedding generated by a two-layer MLP, PX=ReLU(FC2(σ(FC1(X))))P_X = \mathrm{ReLU}(\mathrm{FC}_2(\sigma(\mathrm{FC}_1(X)))). Encoded features: FX=FX+PXF_X' = F_X + P_X, FY=FY+PYF_Y' = F_Y + P_Y.
  • Deep Stacked Layers: Each of L=6L=6–$8$ cross-attention layers applies:
    • Intra-cloud self-attention
    • Inter-cloud cross-attention (e.g., YXY\to X)
    • Residual + MLP
  • Feature Fusion and Squeeze-and-Excitation: Final features ΨX\Psi_X, ΨY\Psi_Y fuse PSE and PFT outputs; an SE module produces ΦX,ΦY\Phi_X, \Phi_Y.
  • "Deep-narrow" refers to many stacked layers of moderate width, enhancing non-local feature discrimination in comparison to shallow-wide Transformer variants.

The deep repeated cross-interaction augmented by explicit positional encoding corrects for indistinct feature extraction and enables direct learning of relative point displacements.

4. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE)

GMCCE filters putative correspondences by measuring geometric consistency under rigid motion:

  • Triangulated Descriptor: For each tentative match (xi,yj)(x_i, y_j):

    1. The ks=10k_s=10 neighbors of xix_i define triangles gsrcpg_{src}^p.
    2. Each triangle is mapped to YY via corresponding matches, forming gtgtpg_{tgt}^p.
    3. For each triangle, the side-length vectors are computed and compared:

    Le(gsrc,gtgt)=α(gαsrcgαtgt)2α(gαsrc+gαtgt)2L_e(g_{src}, g_{tgt}) = \frac{\sqrt{\sum_\alpha (\ell_{g\alpha}^{src} - \ell_{g\alpha}^{tgt})^2}}{\sum_\alpha (\ell_{g\alpha}^{src} + \ell_{g\alpha}^{tgt})^2}

  1. For each pair, Er(xi,yj)E_r(x_i,y_j) is the Minkowski sum over the ksk_s smallest LeL_e.
  • Confidence Scoring:

C(xi,yj)=ψ ⁣(2σ(λEr(xi,yj)))C(x_i,y_j) = \psi\!\left(2\,\sigma(-\lambda\,E_r(x_i,y_j))\right)

with λ=30\lambda=30, where ψ\psi thresholds out very low-confidence matches.

Only high-confidence matches contribute to the weighted Procrustes estimation, robustly rejecting outliers and boosting transformation accuracy in the presence of noise and partial overlaps.

5. Training Regimen and Loss Functions

DIT is trained using a composite objective:

L=Lt+αLc+βLdL = L_t + \alpha L_c + \beta L_d

  • Transformation Loss LtL_t: Penalizes deviation between predicted (RXY,tXYR_{XY}, t_{XY}) and ground-truth (RXY,tXYR_{XY}^*, t_{XY}^*) transformations.
  • Cycle Consistency Loss LcL_c: Encourages learned forward and inverse transforms to be consistent.
  • Discrimination Loss LdL_d: Feature-matching cross-entropy using geometric confidence-weighted correspondences.

Typical hyperparameters: α=1\alpha=1, β=1\beta=1.

Dataset: ModelNet40 (12,311 models, 1,024 points per cloud, 80/20 split). Inputs are randomly rotated ([0,45]3\in[0,45^\circ]^3), translated ([0.5,0.5]3\in[-0.5, 0.5]^3), and subsampled to 60% overlap for partial-to-partial registration. Noise sampled from N(0,0.001)\mathcal{N}(0,0.001) ("low") and N(0,0.01)\mathcal{N}(0,0.01) ("high"), with prescribed clipping.

Optimization uses Adam with lr=3×105\mathrm{lr}=3\times10^{-5}, batch size 16.

6. Empirical Evaluation and Ablation

Quantitative Results:

  • On clean data: DIT achieves RRMSE=2.3×106R_{\rm RMSE}=2.3\times10^{-6}°, tRMSE=1.7×108t_{\rm RMSE}=1.7\times10^{-8}.
  • On low-noise, partial-to-partial: DIT attains RRMSE=0.014R_{\rm RMSE}=0.014°, tRMSE=6.7×105t_{\rm RMSE}=6.7\times10^{-5}.
  • On high-noise, partial-to-partial: RRMSE=1.412R_{\rm RMSE}=1.412°, tRMSE=0.009t_{\rm RMSE}=0.009. These performance levels represent 50–100× reductions in registration error versus prior methods on clean data, and 2–5× reductions on noisy, partial datasets.

Success Rate Analysis: With thresholds RthR_{th} (rotation error) and ttht_{th} (translation error), DIT consistently achieves ≈100% success, outperforming RGM, DCP, DeepGMR, and others by margins of 5–30%.

Ablation Studies:

Variant RRMSER_{\rm RMSE} tRMSEt_{\rm RMSE} Success Rate
Full DIT 1.41° 0.009 94.7%
w/o PSE 41.84° 0.247 1.2%
w/ DGCNN 18.07° 0.069 7.5%
w/o PE (pos. enc) 18.02° 0.091 55.3%
w/o GMCCE 2.36° 0.016 74.4%

Key observations: PSE is essential for robustness to noise; learned positional encoding in PFT contributes approximately 40% to the success rate; GMCCE filtering adds 40–70% accuracy under challenging conditions.

Qualitative Results: DIT maintains precise alignment on clean, noisy, and partial point clouds, even with 40% points missing or noise σ=0.01\sigma=0.01. Competing methods (DCP, DeepGMR) show notable misalignments under these conditions.

7. Comparative Performance and Core Contributions

A summary comparison for partial, low-noise and high-noise registration:

Method Partial-60% Low Noise RRMSER_{\rm RMSE} Partial-60% High Noise RRMSER_{\rm RMSE}
ICP/FGR/RPM-Net >9>9^\circ >12>12^\circ
DCP 4.43° 12.29°
DeepGMR 7.15° 8.96°
RGM 0.74° 2.07°
DIT 0.014° 1.412°

Principal advances:

  • PSE's incorporation of global relationships mitigates feature ambiguity and provides noise resilience.
  • PFT's deep-narrow cross-attention with learned position encoding establishes context-dependent, discriminative features for robust matching.
  • GMCCE confidence evaluation leverages strict geometric consistency, effectively filtering outliers otherwise degrading alignment quality.

In aggregate, DIT achieves state-of-the-art performance for robust point cloud registration on both clean and heavily corrupted input, advancing beyond limitations of convolutional, shallow Transformer, and purely feature-based correspondence models (Chen et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Interaction Transformer (DIT).