V-DETR: 3D Detection with 3DV-RPE
- The paper introduces a novel 3DV-RPE module that enhances DETR by encoding eight box vertices to instill effective locality in sparse 3D attention.
- It employs object-normalized box parameterization and a refined decoder pipeline, leading to significant performance gains and improved AP scores on benchmarks.
- Efficient encoder choices and robust data augmentation methods enable V-DETR to outperform prior models on ScanNetV2 and SUN RGB-D while reducing inference time.
V-DETR is a 3D object detection architecture that extends the DETR (DEtection TRansformer) framework to point cloud data by introducing 3D Vertex Relative Position Encoding (3DV-RPE), object-normalized box parameterization, and an improved pipeline tailored to the unique challenges of sparse 3D domains. Its innovations address the inherent locality requirements and geometric consistency of indoor 3D scenes, achieving state-of-the-art results on benchmarks such as ScanNetV2 and SUN RGB-D (Shen et al., 2023).
1. Architectural Overview
V-DETR operates on point clouds containing both RGB and XYZ channels, with points randomly sampled per scene. The encoder can be realized by either:
- PointNet plus a shallow Transformer encoder (as in 3DETR), or
- Sparse-ResNet-34 with a Feature Pyramid Network (FPN), where the generative sparse decoder is replaced by simple transposed convolutions.
Post-encoding, point-wise features are produced. The detector utilizes learnable 3D object queries , constructed as , where are content features sampled from the encoder and are positional queries derived via an MLP from initial query coordinates.
The decoder comprises 8 standard Transformer decoder layers. Each layer refines the prediction of 3D box parameters for each query, where 0 denotes orientation and 1 parametrize the center and size.
The loss function is a weighted sum of six terms:
2
where GIoU is the generalized IoU loss, FL is focal loss, Huber is used for angle regression, and CE is cross-entropy for classification.
2. 3D Vertex Relative Position Encoding (3DV-RPE)
3DV-RPE is introduced to overcome the failure of DETR-style models to learn locality-appropriate inductive biases in the sparse 3D setting, where queries often attend to distant, irrelevant points.
In each decoder head at layer 3, cross-attention is modulated as:
4
with 5 providing relative position bias per attention head. To compute 6 efficiently:
- For each query 7, its predicted box 8 defines eight vertices 9. For every encoder point 0, the offset 1 is rotated into the box’s canonical frame:
2
where 3 is the rotation matrix aligning box axes.
- Each coordinate is passed through a signed-log nonlinearity:
4
- A separate MLP for each vertex generates 5 per head 6, which are summed across vertices:
7
- To enhance efficiency, a small 8 lookup table is precomputed, and trilinear grid sampling is employed.
This process creates head-wise biases that foster attention concentration around the predicted box.
3. Inductive Bias, Canonicalization, and Locality
3DV-RPE instills a locality inductive bias absent in vanilla DETR, causing cross-attention weights to concentrate on the spatial vicinity of the queried box. The canonical rotation step normalizes all predicted boxes so that, for example, the “top-front” vertex consistently aligns with a fixed direction in the local coordinate system, reducing the model’s burden to handle rotational variation.
Ablation studies confirm the importance of this step: omitting the canonical rotation (i.e., computing 9 in the world frame) leads to a 0 AP1 point decrease on SUN RGB-D. Using all 8 vertices yields higher AP2 (65.0) than using 2 (63.1) or 4 vertices (63.4). The signed-log nonlinearity outperforms alternatives by approximately 2 AP3 points.
Simple strategies such as masking points outside the predicted box in attention provide reasonable AP4, but underperform full 3DV-RPE by 5 AP6 points.
4. Data Normalization and Pipeline Enhancements
V-DETR employs object-normalized box parameterization rather than scene-level normalization, predicting residuals relative to the previous decoder layer’s box estimate:
7
This leverages the observation that 3D object sizes remain physically consistent across scenes.
Convergence is further accelerated by one-to-many matching (DINO style), where each ground-truth box is replicated multiple times in the assignment cost matrix. Standard pipeline refinements include AdamW optimizer, cosine learning rate annealing, extensive 3D data augmentation, and processing of 100K points per scan. The encoder selection (sparse ResNet-34 + FPN) provides a 85 AP point gain over PointNet + Transformer.
Introducing object-normalized reparameterization adds 94 AP0 points. Voxel expansion, as used in FCAF3D, is detrimental for DETR: it lowers AP1 by about 3 points, suggesting DETR variants perform best on raw surface voxels.
5. Quantitative Performance
V-DETR demonstrates substantial improvements over prior DETR-style and 3D detection architectures, as shown in the following results:
| Dataset | Baseline (3DETR) | V-DETR (no TTA) | V-DETR (axis-flip TTA) | Competing SOTA |
|---|---|---|---|---|
| ScanNetV2 AP2 | 65.0 | 77.4 | 77.8 | CAGroup3D: 75.1 |
| ScanNetV2 AP3 | 47.0 | 65.0 | 66.0 | CAGroup3D: 61.3 |
| SUN RGB-D AP4 | 59.1 | - | 68.0 | - |
| SUN RGB-D AP5 | 32.7 | - | 51.1 | - |
Efficiency metrics (ScanNetV2, batch=1, V100):
| Method | Scenes/sec | ms/scene | GPU MB |
|---|---|---|---|
| CAGroup3D | 2.1 | 480 | 1,138 |
| V-DETR (K=1024) | 4.2 | 240 | 642 |
| Light V-DETR (K=256) | 7.7 | 130 | 489 |
V-DETR outperforms the CAGroup3D both in accuracy and efficiency, establishing new records on ScanNetV2 and SUN RGB-D. Increasing the number of queries (6), utilizing 100K input points, and replicating each ground-truth box 4 times yields the best results.
6. Ablation Findings and Implementation Notes
Experiments highlight that:
- Signed-log nonlinearity in 3DV-RPE outperforms tanh and 7.
- Full 8-vertex encoding for relative position bias is optimal.
- 3DV-RPE modulation is superior to hard-masking-based attention localization.
- Sparse-ResNet-34 + FPN provides clear gains over simpler encoders.
- Voxel expansion degrades DETR performance in this context.
- Pipeline settings (100K points, 8 queries, replication factor 4) maximize performance.
A plausible implication is that targeted relative position encodings and geometric normalization are as critical as model or data scaling for efficient 3D DETR-style detectors.
7. Significance and Contributions
V-DETR integrates a 3DV-RPE module with DETR, inducing effective locality for Transformer cross-attention in sparse 3D data and compensating for the limited dataset scale typical in 3D object detection. The framework, enhanced with object-normalized parameterization and an optimized backbone, achieves state-of-the-art accuracy and inference speed, surpassing both fully convolutional and voting-based 3D detectors on major indoor scene benchmarks (Shen et al., 2023). This suggests a generalizable approach for locality-aware attention in 3D detection with Transformers.