Deep Interaction Transformer (DIT)

Updated 6 February 2026

The paper introduces DIT, a robust Transformer-based architecture for point cloud registration that integrates global structure extraction with deep cross-attention.
It employs a three-stage pipeline combining point cloud structure extraction, deep-narrow cross-attention with learned positional encoding, and geometric filtering for confident correspondence.
DIT achieves state-of-the-art alignment accuracy, significantly reducing errors in both clean and noisy, partial point cloud registration compared to prior methods.

The Deep Interaction Transformer (DIT) is a full Transformer-based architecture designed for robust point cloud registration. It addresses key limitations of prior approaches in feature distinctiveness, noise robustness, and outlier handling by introducing a global structure extractor, deep cross-attention with learned positional encoding, and geometry-aware correspondence filtering. DIT achieves state-of-the-art results on both clean and challenging partial/noisy point cloud registration tasks (Chen et al., 2021).

1. Architectural Overview

DIT is a three-stage pipeline that processes source $X\in\mathbb{R}^{N\times3}$ and target $Y\in\mathbb{R}^{M\times3}$ point clouds:

Point Cloud Structure Extractor (PSE): Models local and global geometric relations to output per-point features $F_X\in\mathbb{R}^{N\times d}$ , $F_Y\in\mathbb{R}^{M\times d}$ .
Point Feature Transformer (PFT): Deep stack of cross-attention layers with learned positional encodings, generating enriched features $\Phi_X,\Phi_Y$ .
Geometric Matching–based Correspondence Confidence Evaluation (GMCCE): Assigns geometric-consistency-based inlier confidence to each tentative correspondence, followed by a weighted Procrustes alignment.

The pipeline: $X, Y \to \text{PSE} \to F_X, F_Y \to \text{PFT} \to \Phi_X, \Phi_Y \to$ feature-based matching $\to$ GMCCE confidence $\to$ weighted Procrustes $\to (R, t)$ .

2. Point Cloud Structure Extractor (PSE)

The PSE module captures both local and global structural context.

Local Feature Integrator (LFI): For each $x_i$ , the $k=20$ nearest neighbors are retrieved, and their features concatenated: $T_n'(i) = \mathrm{Concat}[T_{P_n^i}^1, ..., T_{P_n^i}^k] \in \mathbb{R}^{k d_{in}}$ .
Transformer Encoder: The LFI output is processed by stacked MSA (multi-head self-attention) Transformer blocks, using $h=4$ heads and hidden dimension $d=64$ :

$\mathrm{Att}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_K}}\right)V$

Outputs are summed with residuals, processed by layer normalization, and stacked across $N_\ell=3$ layers.

Feature Aggregation: Outputs $T_2,...,T_{N_\ell+1}$ are concatenated and passed through a final LN+ReLU to yield $F_X, F_Y$ .

This combination enables modeling of long-range dependencies and order-invariant geometric structures, which classical point cloud CNNs fail to capture.

3. Deep-Narrow Point Feature Transformer (PFT)

PFT establishes associations between $X$ and $Y$ via deep, stacked Transformer cross-attention:

Learned Positional Encoding: Each point receives a position embedding generated by a two-layer MLP, $P_X = \mathrm{ReLU}(\mathrm{FC}_2(\sigma(\mathrm{FC}_1(X))))$ . Encoded features: $F_X' = F_X + P_X$ , $F_Y' = F_Y + P_Y$ .
Deep Stacked Layers: Each of $L=6$ $L = 6$ –$8$ cross-attention layers applies:
- Intra-cloud self-attention
- Inter-cloud cross-attention (e.g., $Y\to X$ )
- Residual + MLP
Feature Fusion and Squeeze-and-Excitation: Final features $\Psi_X$ , $\Psi_Y$ fuse PSE and PFT outputs; an SE module produces $\Phi_X, \Phi_Y$ .
"Deep-narrow" refers to many stacked layers of moderate width, enhancing non-local feature discrimination in comparison to shallow-wide Transformer variants.

The deep repeated cross-interaction augmented by explicit positional encoding corrects for indistinct feature extraction and enables direct learning of relative point displacements.

4. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE)

GMCCE filters putative correspondences by measuring geometric consistency under rigid motion:

Triangulated Descriptor: For each tentative match $(x_i, y_j)$ $(x_{i}, y_{j})$ :
1. The $k_s=10$ neighbors of $x_i$ define triangles $g_{src}^p$ .
2. Each triangle is mapped to $Y$ via corresponding matches, forming $g_{tgt}^p$ .
3. For each triangle, the side-length vectors are computed and compared:
$L_e(g_{src}, g_{tgt}) = \frac{\sqrt{\sum_\alpha (\ell_{g\alpha}^{src} - \ell_{g\alpha}^{tgt})^2}}{\sum_\alpha (\ell_{g\alpha}^{src} + \ell_{g\alpha}^{tgt})^2}$

For each pair, $E_r(x_i,y_j)$ is the Minkowski sum over the $k_s$ smallest $L_e$ .

Confidence Scoring:

$C(x_i,y_j) = \psi\!\left(2\,\sigma(-\lambda\,E_r(x_i,y_j))\right)$

with $\lambda=30$ , where $\psi$ thresholds out very low-confidence matches.

Only high-confidence matches contribute to the weighted Procrustes estimation, robustly rejecting outliers and boosting transformation accuracy in the presence of noise and partial overlaps.

5. Training Regimen and Loss Functions

DIT is trained using a composite objective:

$L = L_t + \alpha L_c + \beta L_d$

Transformation Loss $L_t$ : Penalizes deviation between predicted ( $R_{XY}, t_{XY}$ ) and ground-truth ( $R_{XY}^*, t_{XY}^*$ ) transformations.
Cycle Consistency Loss $L_c$ : Encourages learned forward and inverse transforms to be consistent.
Discrimination Loss $L_d$ : Feature-matching cross-entropy using geometric confidence-weighted correspondences.

Typical hyperparameters: $\alpha=1$ , $\beta=1$ .

Dataset: ModelNet40 (12,311 models, 1,024 points per cloud, 80/20 split). Inputs are randomly rotated ( $\in[0,45^\circ]^3$ ), translated ( $\in[-0.5, 0.5]^3$ ), and subsampled to 60% overlap for partial-to-partial registration. Noise sampled from $\mathcal{N}(0,0.001)$ ("low") and $\mathcal{N}(0,0.01)$ ("high"), with prescribed clipping.

Optimization uses Adam with $\mathrm{lr}=3\times10^{-5}$ , batch size 16.

6. Empirical Evaluation and Ablation

Quantitative Results:

On clean data: DIT achieves $R_{\rm RMSE}=2.3\times10^{-6}$ °, $t_{\rm RMSE}=1.7\times10^{-8}$ .
On low-noise, partial-to-partial: DIT attains $R_{\rm RMSE}=0.014$ °, $t_{\rm RMSE}=6.7\times10^{-5}$ .
On high-noise, partial-to-partial: $R_{\rm RMSE}=1.412$ °, $t_{\rm RMSE}=0.009$ . These performance levels represent 50–100× reductions in registration error versus prior methods on clean data, and 2–5× reductions on noisy, partial datasets.

Success Rate Analysis: With thresholds $R_{th}$ (rotation error) and $t_{th}$ (translation error), DIT consistently achieves ≈100% success, outperforming RGM, DCP, DeepGMR, and others by margins of 5–30%.

Ablation Studies:

Variant	$R_{\rm RMSE}$	$t_{\rm RMSE}$	Success Rate
Full DIT	1.41°	0.009	94.7%
w/o PSE	41.84°	0.247	1.2%
w/ DGCNN	18.07°	0.069	7.5%
w/o PE (pos. enc)	18.02°	0.091	55.3%
w/o GMCCE	2.36°	0.016	74.4%

Key observations: PSE is essential for robustness to noise; learned positional encoding in PFT contributes approximately 40% to the success rate; GMCCE filtering adds 40–70% accuracy under challenging conditions.

Qualitative Results: DIT maintains precise alignment on clean, noisy, and partial point clouds, even with 40% points missing or noise $\sigma=0.01$ . Competing methods (DCP, DeepGMR) show notable misalignments under these conditions.

7. Comparative Performance and Core Contributions

A summary comparison for partial, low-noise and high-noise registration:

Method	Partial-60% Low Noise $R_{\rm RMSE}$	Partial-60% High Noise $R_{\rm RMSE}$
ICP/FGR/RPM-Net	$>9^\circ$	$>12^\circ$
DCP	4.43°	12.29°
DeepGMR	7.15°	8.96°
RGM	0.74°	2.07°
DIT	0.014°	1.412°

Principal advances:

PSE's incorporation of global relationships mitigates feature ambiguity and provides noise resilience.
PFT's deep-narrow cross-attention with learned position encoding establishes context-dependent, discriminative features for robust matching.
GMCCE confidence evaluation leverages strict geometric consistency, effectively filtering outliers otherwise degrading alignment quality.

In aggregate, DIT achieves state-of-the-art performance for robust point cloud registration on both clean and heavily corrupted input, advancing beyond limitations of convolutional, shallow Transformer, and purely feature-based correspondence models (Chen et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Interaction Transformer (DIT).

Deep Interaction Transformer (DIT)

1. Architectural Overview

2. Point Cloud Structure Extractor (PSE)

3. Deep-Narrow Point Feature Transformer (PFT)

4. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE)

5. Training Regimen and Loss Functions

6. Empirical Evaluation and Ablation

7. Comparative Performance and Core Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Deep Interaction Transformer (DIT)

1. Architectural Overview

2. Point Cloud Structure Extractor (PSE)

3. Deep-Narrow Point Feature Transformer (PFT)

4. Geometric Matching–based Correspondence Confidence Evaluation (GMCCE)

5. Training Regimen and Loss Functions

6. Empirical Evaluation and Ablation

7. Comparative Performance and Core Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research