Vote2Cap-DETR: 3D Captioning Transformer
- The paper introduces Vote2Cap-DETR, a unified transformer-based architecture for 3D dense captioning that leverages a learnable vote query mechanism and set prediction.
- It employs a dual-clued caption head and iterative spatial refinement to improve both caption descriptiveness and localization accuracy on benchmarks like ScanRefer and Nr3D.
- Empirical evaluations demonstrate significant gains in CIDEr scores and mAP, outperforming traditional detect-then-describe pipelines in complex indoor scenes.
Vote2Cap-DETR is a fully transformer-based end-to-end framework for 3D dense captioning, designed to unify 3D object localization and region-level caption generation within a one-stage architecture. By leveraging a learnable “vote query” mechanism for spatial object anchoring, a dual-clued language head, and set prediction via DETR-style transformers, Vote2Cap-DETR and its advanced variant Vote2Cap-DETR++ demonstrate state-of-the-art performance on standard benchmarks, substantially improving over detect-then-describe pipelines in complex indoor scenes (Chen et al., 2023, Chen et al., 2023).
1. Model Architecture
Input to Vote2Cap-DETR is a raw point-cloud with 3D coordinates and per-point features . Downsampling to 2,048 points is performed via PointNet++ set abstraction. Scene encoding uses a 3DETR-style backbone: three masked self-attention layers interleaved with local set-abstraction, yielding scene tokens , where .
Vote-query generation identifies seed points by Farthest Point Sampling (FPS) over the encoded scene, then applies an MLP to seed features to predict spatial offsets , producing 3D anchors . Local PointNet++-style feature aggregation around yields .
These (position, feature) pairs act as queries for an 8-layer transformer decoder. Each decoder layer updates query features as
with a 3D Fourier positional encoding (Chen et al., 2023).
Two heads operate in parallel:
- Detection head: Per-query MLPs regress 3D box corners, center offsets, and semantic class scores.
- Caption head (Dual-Clued Captioner): A 2-layer transformer decoder that attends to the query’s final feature and a local context set of the nearest scene-token features, decoding captions autoregressively (Chen et al., 2023).
2. Decoupled Localization and Captioning: Vote2Cap-DETR++
Vote2Cap-DETR++ introduces explicit decoupling of queries for localization () and captioning (). At each decoder layer, matching queries are generated from via a learned projection:
Both sets traverse the same decoding block, but each prediction head operates only over its designated query set, with positions tied for correspondence.
To refine object spatial anchors, iterative spatial refinement updates at each layer:
This strategy incrementally drifts queries toward ground-truth centers, empirically increasing mean average precision (mAP) and speeding convergence (Chen et al., 2023).
3. Spatial Information Injection for Captions
Vote2Cap-DETR++ enhances descriptiveness and localization precision by integrating spatial priors into the captioning process. An absolute position token is prepended to the caption decoder input, explicitly encoding 3D location. For the local context , rank-based embeddings , learned as a function of each token’s distance rank to , are added to the context features:
This injects geometric hierarchy into linguistic generation and improves spatial phrase accuracy (Chen et al., 2023).
4. Training Objectives and Loss Functions
Optimization supervises three principal modules:
- Vote Query Loss:
- Detection Set Loss: After Hungarian matching,
- Query Refinement Loss (Vote2Cap-DETR++ only): Matches , but applied recursively per decoder layer to .
For captioning, two objectives are used:
- MLE Loss:
- SCST Loss (Self-Critical Sequence Training, using CIDEr reward):
The multi-task objective is
where (plus refinement loss as appropriate) and are validation-tuned (Chen et al., 2023, Chen et al., 2023).
5. Datasets, Evaluation, and Empirical Performance
Vote2Cap-DETR and Vote2Cap-DETR++ are evaluated on ScanRefer and Nr3D—benchmarks for 3D region-level captioning and localization. Key metrics are:
- [email protected]: CIDEr@IoU≥0.5, measuring caption quality conditioned on sufficiently accurate region localization.
- [email protected]: Standard mean average precision for 3D object detection with IoU≥0.5.
Summarized results:
| Method | Dataset | [email protected] (SCST) | [email protected] |
|---|---|---|---|
| VoteNet+DtD baseline | ScanRefer | ≈46% | ≈38% |
| Vote2Cap-DETR | ScanRefer | 73.8% | 52.1% |
| Vote2Cap-DETR++ | ScanRefer | 78.2% | 55.5% |
| Best prior (SCST) | Nr3D | ≈38% | – |
| Vote2Cap-DETR | Nr3D | 45.5% | – |
| Vote2Cap-DETR++ | Nr3D | 47.6% | – |
These models consistently outperform previous detect-then-describe pipelines on both localization and captioning (Chen et al., 2023, Chen et al., 2023).
6. Ablations and Analysis
Ablations confirm the contribution of core architectural elements:
- Vote Queries: Lead to a +2.98 [email protected] over a plain 3DETR backbone, with faster convergence.
- Local Context in Captioning: Using local tokens for the captioner improves [email protected] by +0.19 over global context.
- Set-to-Set SCST: Outperforms set-to-sentence training, with a +2.38 [email protected] gain.
- Iterative Refinement and Decoupling (DETR++): Further increases in both mAP and [email protected].
- NMS Free: Vote2Cap-DETR is robust, yielding [email protected] = 52.8% regardless of NMS application (Chen et al., 2023).
7. Position in the Literature and Limitations
Vote2Cap-DETR and Vote2Cap-DETR++ are the first transformer-based, one-stage architectures for 3D dense captioning that replace multi-step, hand-crafted relation modules with set prediction and spatially-aware transformer heads. This approach avoids error accumulation typical in cascade pipelines with duplicated and inaccurate box proposals in cluttered scenes.
Qualitative analysis shows vote queries cluster near object centers, enhancing detection, while the dual-clued captioner improves attribute and relationship selection by exploiting rich 3D context. Limitations include lack of explicit modeling for rotated bounding boxes and a reliance on CIDEr-driven SCST, which may bias toward n-gram matching rather than conceptual diversity (Chen et al., 2023, Chen et al., 2023).