V-DETR: 3D Detection with 3DV-RPE

Updated 1 May 2026

The paper introduces a novel 3DV-RPE module that enhances DETR by encoding eight box vertices to instill effective locality in sparse 3D attention.
It employs object-normalized box parameterization and a refined decoder pipeline, leading to significant performance gains and improved AP scores on benchmarks.
Efficient encoder choices and robust data augmentation methods enable V-DETR to outperform prior models on ScanNetV2 and SUN RGB-D while reducing inference time.

V-DETR is a 3D object detection architecture that extends the DETR (DEtection TRansformer) framework to point cloud data by introducing 3D Vertex Relative Position Encoding (3DV-RPE), object-normalized box parameterization, and an improved pipeline tailored to the unique challenges of sparse 3D domains. Its innovations address the inherent locality requirements and geometric consistency of indoor 3D scenes, achieving state-of-the-art results on benchmarks such as ScanNetV2 and SUN RGB-D (Shen et al., 2023).

1. Architectural Overview

V-DETR operates on point clouds $I \in \mathbb{R}^{N \times 6}$ containing both RGB and XYZ channels, with $N \approx 100\text{K}$ points randomly sampled per scene. The encoder can be realized by either:

PointNet plus a shallow Transformer encoder (as in 3DETR), or
Sparse-ResNet-34 with a Feature Pyramid Network (FPN), where the generative sparse decoder is replaced by simple transposed convolutions.

Post-encoding, $M \approx 4\text{K}$ point-wise features $F \in \mathbb{R}^{M \times C}$ are produced. The detector utilizes $K=1\,024$ learnable 3D object queries $Q \in \mathbb{R}^{K \times C}$ , constructed as $Q = Q_c + Q_p$ , where $Q_c$ are content features sampled from the encoder and $Q_p$ are positional queries derived via an MLP from initial query coordinates.

The decoder comprises 8 standard Transformer decoder layers. Each layer refines the prediction of 3D box parameters $b^\ell = (\theta, x, y, z, w, l, h) \in \mathbb{R}^7$ for each query, where $N \approx 100\text{K}$ 0 denotes orientation and $N \approx 100\text{K}$ 1 parametrize the center and size.

The loss function is a weighted sum of six terms:

$N \approx 100\text{K}$ 2

where GIoU is the generalized IoU loss, FL is focal loss, Huber is used for angle regression, and CE is cross-entropy for classification.

2. 3D Vertex Relative Position Encoding (3DV-RPE)

3DV-RPE is introduced to overcome the failure of DETR-style models to learn locality-appropriate inductive biases in the sparse 3D setting, where queries often attend to distant, irrelevant points.

In each decoder head at layer $N \approx 100\text{K}$ 3, cross-attention is modulated as:

$N \approx 100\text{K}$ 4

with $N \approx 100\text{K}$ 5 providing relative position bias per attention head. To compute $N \approx 100\text{K}$ 6 efficiently:

For each query $N \approx 100\text{K}$ 7, its predicted box $N \approx 100\text{K}$ 8 defines eight vertices $N \approx 100\text{K}$ 9. For every encoder point $M \approx 4\text{K}$ 0, the offset $M \approx 4\text{K}$ 1 is rotated into the box’s canonical frame:

$M \approx 4\text{K}$ 2

where $M \approx 4\text{K}$ 3 is the rotation matrix aligning box axes.

Each coordinate is passed through a signed-log nonlinearity:

$M \approx 4\text{K}$ 4

A separate MLP for each vertex generates $M \approx 4\text{K}$ 5 per head $M \approx 4\text{K}$ 6, which are summed across vertices:

$M \approx 4\text{K}$ 7

To enhance efficiency, a small $M \approx 4\text{K}$ 8 lookup table is precomputed, and trilinear grid sampling is employed.

This process creates head-wise biases that foster attention concentration around the predicted box.

3. Inductive Bias, Canonicalization, and Locality

3DV-RPE instills a locality inductive bias absent in vanilla DETR, causing cross-attention weights to concentrate on the spatial vicinity of the queried box. The canonical rotation step normalizes all predicted boxes so that, for example, the “top-front” vertex consistently aligns with a fixed direction in the local coordinate system, reducing the model’s burden to handle rotational variation.

Ablation studies confirm the importance of this step: omitting the canonical rotation (i.e., computing $M \approx 4\text{K}$ 9 in the world frame) leads to a $F \in \mathbb{R}^{M \times C}$ 0 AP $F \in \mathbb{R}^{M \times C}$ 1 point decrease on SUN RGB-D. Using all 8 vertices yields higher AP $F \in \mathbb{R}^{M \times C}$ 2 (65.0) than using 2 (63.1) or 4 vertices (63.4). The signed-log nonlinearity outperforms alternatives by approximately 2 AP $F \in \mathbb{R}^{M \times C}$ 3 points.

Simple strategies such as masking points outside the predicted box in attention provide reasonable AP $F \in \mathbb{R}^{M \times C}$ 4, but underperform full 3DV-RPE by $F \in \mathbb{R}^{M \times C}$ 5 AP $F \in \mathbb{R}^{M \times C}$ 6 points.

4. Data Normalization and Pipeline Enhancements

V-DETR employs object-normalized box parameterization rather than scene-level normalization, predicting residuals relative to the previous decoder layer’s box estimate:

$F \in \mathbb{R}^{M \times C}$ 7

This leverages the observation that 3D object sizes remain physically consistent across scenes.

Convergence is further accelerated by one-to-many matching (DINO style), where each ground-truth box is replicated multiple times in the assignment cost matrix. Standard pipeline refinements include AdamW optimizer, cosine learning rate annealing, extensive 3D data augmentation, and processing of 100K points per scan. The encoder selection (sparse ResNet-34 + FPN) provides a $F \in \mathbb{R}^{M \times C}$ 85 AP point gain over PointNet + Transformer.

Introducing object-normalized reparameterization adds $F \in \mathbb{R}^{M \times C}$ 94 AP $K=1\,024$ 0 points. Voxel expansion, as used in FCAF3D, is detrimental for DETR: it lowers AP $K=1\,024$ 1 by about 3 points, suggesting DETR variants perform best on raw surface voxels.

5. Quantitative Performance

V-DETR demonstrates substantial improvements over prior DETR-style and 3D detection architectures, as shown in the following results:

Dataset	Baseline (3DETR)	V-DETR (no TTA)	V-DETR (axis-flip TTA)	Competing SOTA
ScanNetV2 AP $K=1\,024$ 2	65.0	77.4	77.8	CAGroup3D: 75.1
ScanNetV2 AP $K=1\,024$ 3	47.0	65.0	66.0	CAGroup3D: 61.3
SUN RGB-D AP $K=1\,024$ 4	59.1	-	68.0	-
SUN RGB-D AP $K=1\,024$ 5	32.7	-	51.1	-

Efficiency metrics (ScanNetV2, batch=1, V100):

Method	Scenes/sec	ms/scene	GPU MB
CAGroup3D	2.1	480	1,138
V-DETR (K=1024)	4.2	240	642
Light V-DETR (K=256)	7.7	130	489

V-DETR outperforms the CAGroup3D both in accuracy and efficiency, establishing new records on ScanNetV2 and SUN RGB-D. Increasing the number of queries ( $K=1\,024$ 6), utilizing 100K input points, and replicating each ground-truth box 4 times yields the best results.

6. Ablation Findings and Implementation Notes

Experiments highlight that:

Signed-log nonlinearity in 3DV-RPE outperforms tanh and $K=1\,024$ 7.
Full 8-vertex encoding for relative position bias is optimal.
3DV-RPE modulation is superior to hard-masking-based attention localization.
Sparse-ResNet-34 + FPN provides clear gains over simpler encoders.
Voxel expansion degrades DETR performance in this context.
Pipeline settings (100K points, $K=1\,024$ 8 queries, replication factor 4) maximize performance.

A plausible implication is that targeted relative position encodings and geometric normalization are as critical as model or data scaling for efficient 3D DETR-style detectors.

7. Significance and Contributions

V-DETR integrates a 3DV-RPE module with DETR, inducing effective locality for Transformer cross-attention in sparse 3D data and compensating for the limited dataset scale typical in 3D object detection. The framework, enhanced with object-normalized parameterization and an optimized backbone, achieves state-of-the-art accuracy and inference speed, surpassing both fully convolutional and voting-based 3D detectors on major indoor scene benchmarks (Shen et al., 2023). This suggests a generalizable approach for locality-aware attention in 3D detection with Transformers.

Markdown Report Issue Upgrade to Chat

References (1)

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-DETR.

V-DETR: 3D Detection with 3DV-RPE

1. Architectural Overview

2. 3D Vertex Relative Position Encoding (3DV-RPE)

3. Inductive Bias, Canonicalization, and Locality

4. Data Normalization and Pipeline Enhancements

5. Quantitative Performance

6. Ablation Findings and Implementation Notes

7. Significance and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

V-DETR: 3D Detection with 3DV-RPE

1. Architectural Overview

2. 3D Vertex Relative Position Encoding (3DV-RPE)

3. Inductive Bias, Canonicalization, and Locality

4. Data Normalization and Pipeline Enhancements

5. Quantitative Performance

6. Ablation Findings and Implementation Notes

7. Significance and Contributions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research