Papers
Topics
Authors
Recent
Search
2000 character limit reached

V-DETR: 3D Detection with 3DV-RPE

Updated 1 May 2026
  • The paper introduces a novel 3DV-RPE module that enhances DETR by encoding eight box vertices to instill effective locality in sparse 3D attention.
  • It employs object-normalized box parameterization and a refined decoder pipeline, leading to significant performance gains and improved AP scores on benchmarks.
  • Efficient encoder choices and robust data augmentation methods enable V-DETR to outperform prior models on ScanNetV2 and SUN RGB-D while reducing inference time.

V-DETR is a 3D object detection architecture that extends the DETR (DEtection TRansformer) framework to point cloud data by introducing 3D Vertex Relative Position Encoding (3DV-RPE), object-normalized box parameterization, and an improved pipeline tailored to the unique challenges of sparse 3D domains. Its innovations address the inherent locality requirements and geometric consistency of indoor 3D scenes, achieving state-of-the-art results on benchmarks such as ScanNetV2 and SUN RGB-D (Shen et al., 2023).

1. Architectural Overview

V-DETR operates on point clouds IRN×6I \in \mathbb{R}^{N \times 6} containing both RGB and XYZ channels, with N100KN \approx 100\text{K} points randomly sampled per scene. The encoder can be realized by either:

Post-encoding, M4KM \approx 4\text{K} point-wise features FRM×CF \in \mathbb{R}^{M \times C} are produced. The detector utilizes K=1024K=1\,024 learnable 3D object queries QRK×CQ \in \mathbb{R}^{K \times C}, constructed as Q=Qc+QpQ = Q_c + Q_p, where QcQ_c are content features sampled from the encoder and QpQ_p are positional queries derived via an MLP from initial query coordinates.

The decoder comprises 8 standard Transformer decoder layers. Each layer refines the prediction of 3D box parameters b=(θ,x,y,z,w,l,h)R7b^\ell = (\theta, x, y, z, w, l, h) \in \mathbb{R}^7 for each query, where N100KN \approx 100\text{K}0 denotes orientation and N100KN \approx 100\text{K}1 parametrize the center and size.

The loss function is a weighted sum of six terms:

N100KN \approx 100\text{K}2

where GIoU is the generalized IoU loss, FL is focal loss, Huber is used for angle regression, and CE is cross-entropy for classification.

2. 3D Vertex Relative Position Encoding (3DV-RPE)

3DV-RPE is introduced to overcome the failure of DETR-style models to learn locality-appropriate inductive biases in the sparse 3D setting, where queries often attend to distant, irrelevant points.

In each decoder head at layer N100KN \approx 100\text{K}3, cross-attention is modulated as:

N100KN \approx 100\text{K}4

with N100KN \approx 100\text{K}5 providing relative position bias per attention head. To compute N100KN \approx 100\text{K}6 efficiently:

  • For each query N100KN \approx 100\text{K}7, its predicted box N100KN \approx 100\text{K}8 defines eight vertices N100KN \approx 100\text{K}9. For every encoder point M4KM \approx 4\text{K}0, the offset M4KM \approx 4\text{K}1 is rotated into the box’s canonical frame:

M4KM \approx 4\text{K}2

where M4KM \approx 4\text{K}3 is the rotation matrix aligning box axes.

  • Each coordinate is passed through a signed-log nonlinearity:

M4KM \approx 4\text{K}4

  • A separate MLP for each vertex generates M4KM \approx 4\text{K}5 per head M4KM \approx 4\text{K}6, which are summed across vertices:

M4KM \approx 4\text{K}7

  • To enhance efficiency, a small M4KM \approx 4\text{K}8 lookup table is precomputed, and trilinear grid sampling is employed.

This process creates head-wise biases that foster attention concentration around the predicted box.

3. Inductive Bias, Canonicalization, and Locality

3DV-RPE instills a locality inductive bias absent in vanilla DETR, causing cross-attention weights to concentrate on the spatial vicinity of the queried box. The canonical rotation step normalizes all predicted boxes so that, for example, the “top-front” vertex consistently aligns with a fixed direction in the local coordinate system, reducing the model’s burden to handle rotational variation.

Ablation studies confirm the importance of this step: omitting the canonical rotation (i.e., computing M4KM \approx 4\text{K}9 in the world frame) leads to a FRM×CF \in \mathbb{R}^{M \times C}0 APFRM×CF \in \mathbb{R}^{M \times C}1 point decrease on SUN RGB-D. Using all 8 vertices yields higher APFRM×CF \in \mathbb{R}^{M \times C}2 (65.0) than using 2 (63.1) or 4 vertices (63.4). The signed-log nonlinearity outperforms alternatives by approximately 2 APFRM×CF \in \mathbb{R}^{M \times C}3 points.

Simple strategies such as masking points outside the predicted box in attention provide reasonable APFRM×CF \in \mathbb{R}^{M \times C}4, but underperform full 3DV-RPE by FRM×CF \in \mathbb{R}^{M \times C}5 APFRM×CF \in \mathbb{R}^{M \times C}6 points.

4. Data Normalization and Pipeline Enhancements

V-DETR employs object-normalized box parameterization rather than scene-level normalization, predicting residuals relative to the previous decoder layer’s box estimate:

FRM×CF \in \mathbb{R}^{M \times C}7

This leverages the observation that 3D object sizes remain physically consistent across scenes.

Convergence is further accelerated by one-to-many matching (DINO style), where each ground-truth box is replicated multiple times in the assignment cost matrix. Standard pipeline refinements include AdamW optimizer, cosine learning rate annealing, extensive 3D data augmentation, and processing of 100K points per scan. The encoder selection (sparse ResNet-34 + FPN) provides a FRM×CF \in \mathbb{R}^{M \times C}85 AP point gain over PointNet + Transformer.

Introducing object-normalized reparameterization adds FRM×CF \in \mathbb{R}^{M \times C}94 APK=1024K=1\,0240 points. Voxel expansion, as used in FCAF3D, is detrimental for DETR: it lowers APK=1024K=1\,0241 by about 3 points, suggesting DETR variants perform best on raw surface voxels.

5. Quantitative Performance

V-DETR demonstrates substantial improvements over prior DETR-style and 3D detection architectures, as shown in the following results:

Dataset Baseline (3DETR) V-DETR (no TTA) V-DETR (axis-flip TTA) Competing SOTA
ScanNetV2 APK=1024K=1\,0242 65.0 77.4 77.8 CAGroup3D: 75.1
ScanNetV2 APK=1024K=1\,0243 47.0 65.0 66.0 CAGroup3D: 61.3
SUN RGB-D APK=1024K=1\,0244 59.1 - 68.0 -
SUN RGB-D APK=1024K=1\,0245 32.7 - 51.1 -

Efficiency metrics (ScanNetV2, batch=1, V100):

Method Scenes/sec ms/scene GPU MB
CAGroup3D 2.1 480 1,138
V-DETR (K=1024) 4.2 240 642
Light V-DETR (K=256) 7.7 130 489

V-DETR outperforms the CAGroup3D both in accuracy and efficiency, establishing new records on ScanNetV2 and SUN RGB-D. Increasing the number of queries (K=1024K=1\,0246), utilizing 100K input points, and replicating each ground-truth box 4 times yields the best results.

6. Ablation Findings and Implementation Notes

Experiments highlight that:

  • Signed-log nonlinearity in 3DV-RPE outperforms tanh and K=1024K=1\,0247.
  • Full 8-vertex encoding for relative position bias is optimal.
  • 3DV-RPE modulation is superior to hard-masking-based attention localization.
  • Sparse-ResNet-34 + FPN provides clear gains over simpler encoders.
  • Voxel expansion degrades DETR performance in this context.
  • Pipeline settings (100K points, K=1024K=1\,0248 queries, replication factor 4) maximize performance.

A plausible implication is that targeted relative position encodings and geometric normalization are as critical as model or data scaling for efficient 3D DETR-style detectors.

7. Significance and Contributions

V-DETR integrates a 3DV-RPE module with DETR, inducing effective locality for Transformer cross-attention in sparse 3D data and compensating for the limited dataset scale typical in 3D object detection. The framework, enhanced with object-normalized parameterization and an optimized backbone, achieves state-of-the-art accuracy and inference speed, surpassing both fully convolutional and voting-based 3D detectors on major indoor scene benchmarks (Shen et al., 2023). This suggests a generalizable approach for locality-aware attention in 3D detection with Transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-DETR.