Vote2Cap-DETR: 3D Captioning Transformer

Updated 6 March 2026

The paper introduces Vote2Cap-DETR, a unified transformer-based architecture for 3D dense captioning that leverages a learnable vote query mechanism and set prediction.
It employs a dual-clued caption head and iterative spatial refinement to improve both caption descriptiveness and localization accuracy on benchmarks like ScanRefer and Nr3D.
Empirical evaluations demonstrate significant gains in CIDEr scores and mAP, outperforming traditional detect-then-describe pipelines in complex indoor scenes.

Vote2Cap-DETR is a fully transformer-based end-to-end framework for 3D dense captioning, designed to unify 3D object localization and region-level caption generation within a one-stage architecture. By leveraging a learnable “vote query” mechanism for spatial object anchoring, a dual-clued language head, and set prediction via DETR-style transformers, Vote2Cap-DETR and its advanced variant Vote2Cap-DETR++ demonstrate state-of-the-art performance on standard benchmarks, substantially improving over detect-then-describe pipelines in complex indoor scenes (Chen et al., 2023, Chen et al., 2023).

1. Model Architecture

Input to Vote2Cap-DETR is a raw point-cloud $PC = \{(p_{in}^i, f_{in}^i)\}_{i=1}^N$ with 3D coordinates $p_{in}^i \in \mathbb{R}^3$ and per-point features $f_{in}^i \in \mathbb{R}^F$ . Downsampling to 2,048 points is performed via PointNet++ set abstraction. Scene encoding uses a 3DETR-style backbone: three masked self-attention layers interleaved with local set-abstraction, yielding $M = 1,024$ scene tokens $\{(p_{enc}^j, f_{enc}^j)\}_{j=1}^M$ , where $f_{enc}^j \in \mathbb{R}^{256}$ .

Vote-query generation identifies $K=256$ seed points $p_{seed}^k$ by Farthest Point Sampling (FPS) over the encoded scene, then applies an MLP $FFN_{vote}$ to seed features $f_{seed}^k$ to predict spatial offsets $\Delta p_{vote}^k$ , producing 3D anchors $p_{vq}^k = p_{seed}^k + FFN_{vote}(f_{seed}^k)$ . Local PointNet++-style feature aggregation around $p_{vq}^k$ yields $f_{vq}^k$ .

These (position, feature) pairs act as queries for an 8-layer transformer decoder. Each decoder layer updates query features as

$f_{q}^{(i)} = DecoderLayer_i(f_{q}^{(i-1)} + PE(p_{vq})),$

with $PE(\cdot)$ a 3D Fourier positional encoding (Chen et al., 2023).

Two heads operate in parallel:

Detection head: Per-query MLPs regress 3D box corners, center offsets, and semantic class scores.
Caption head (Dual-Clued Captioner): A 2-layer transformer decoder that attends to the query’s final feature $\mathcal{V}^q = f_{q}^{(L)}$ and a local context set $\mathcal{V}^s$ of the $k_s$ nearest scene-token features, decoding captions autoregressively (Chen et al., 2023).

2. Decoupled Localization and Captioning: Vote2Cap-DETR++

Vote2Cap-DETR++ introduces explicit decoupling of queries for localization ( $[LOC]$ ) and captioning ( $[CAP]$ ). At each decoder layer, matching $[CAP]$ queries are generated from $[LOC]$ via a learned projection:

$f_{cap}^{(i)} = W_{proj} f_{loc}^{(i)}, \quad p_{cap}^{(i)} = p_{loc}^{(i)}.$

Both sets traverse the same decoding block, but each prediction head operates only over its designated query set, with positions tied for correspondence.

To refine object spatial anchors, iterative spatial refinement updates $p_{loc}^{(i)}$ at each layer:

$p_{loc}^{(i)} = p_{loc}^{(i-1)} + FFN_{refine}(f_{loc}^{(i-1)}).$

This strategy incrementally drifts queries toward ground-truth centers, empirically increasing mean average precision (mAP) and speeding convergence (Chen et al., 2023).

3. Spatial Information Injection for Captions

Vote2Cap-DETR++ enhances descriptiveness and localization precision by integrating spatial priors into the captioning process. An absolute position token $PE(p_{cap}^{(L)})$ is prepended to the caption decoder input, explicitly encoding 3D location. For the local context $\{v_j^s\}$ , rank-based embeddings $\mathcal{V}^s_{pos}$ , learned as a function of each token’s distance rank to $p_{cap}^{(L)}$ , are added to the context features:

$c^* = \arg\max_c P(c \mid \mathcal{V}^s, \mathcal{V}^s_{pos};\,\mathcal{V}^q, \mathcal{V}^q_{pos}).$

This injects geometric hierarchy into linguistic generation and improves spatial phrase accuracy (Chen et al., 2023).

4. Training Objectives and Loss Functions

Optimization supervises three principal modules:

Vote Query Loss:

$\mathcal{L}_{vq} = \frac1M \sum_{i=1}^M \sum_{j=1}^{N_{gt}} \|p_{vote}^i - c_j^{center}\|_1\, \mathbf{1}\{p_{enc}^i \in {\rm obj}\,j\}$

Detection Set Loss: After Hungarian matching,

$\mathcal{L}_{det} = \alpha_1 \mathcal{L}_{giou} + \alpha_2 \mathcal{L}_{cls} + \alpha_3 \mathcal{L}_{center} + \alpha_4 \mathcal{L}_{size}$

Query Refinement Loss (Vote2Cap-DETR++ only): Matches $\mathcal{L}_{vq}$ , but applied recursively per decoder layer to $p_{loc}^{(i)}$ .

For captioning, two objectives are used:

MLE Loss:

$\mathcal{L}_{cap}^{\rm MLE} = -\sum_{t=1}^T \log P(w_t \mid w_{1:t-1}, \mathcal{V})$

SCST Loss (Self-Critical Sequence Training, using CIDEr reward):

$\mathcal{L}_{cap}^{\rm SCST} = -\sum_{i=1}^k (R(\hat{c}_i) - R(\hat{g})) \frac{1}{|\hat{c}_i|} \log P(\hat{c}_i \mid \mathcal{V})$

The multi-task objective is

$\mathcal{L}_{\rm total} = \lambda_{loc}\, \mathcal{L}_{loc} + \lambda_{cap}\,\mathcal{L}_{cap},$

where $\mathcal{L}_{loc} = \mathcal{L}_{vq} + \mathcal{L}_{det}$ (plus refinement loss as appropriate) and $(\lambda_{loc}, \lambda_{cap})$ are validation-tuned (Chen et al., 2023, Chen et al., 2023).

5. Datasets, Evaluation, and Empirical Performance

Vote2Cap-DETR and Vote2Cap-DETR++ are evaluated on ScanRefer and Nr3D—benchmarks for 3D region-level captioning and localization. Key metrics are:

[email protected]: CIDEr@IoU≥0.5, measuring caption quality conditioned on sufficiently accurate region localization.
[email protected]: Standard mean average precision for 3D object detection with IoU≥0.5.

Summarized results:

Method	Dataset	[email protected] (SCST)	[email protected]
VoteNet+DtD baseline	ScanRefer	≈46%	≈38%
Vote2Cap-DETR	ScanRefer	73.8%	52.1%
Vote2Cap-DETR++	ScanRefer	78.2%	55.5%
Best prior (SCST)	Nr3D	≈38%	–
Vote2Cap-DETR	Nr3D	45.5%	–
Vote2Cap-DETR++	Nr3D	47.6%	–

These models consistently outperform previous detect-then-describe pipelines on both localization and captioning (Chen et al., 2023, Chen et al., 2023).

6. Ablations and Analysis

Ablations confirm the contribution of core architectural elements:

Vote Queries: Lead to a +2.98 [email protected] over a plain 3DETR backbone, with faster convergence.
Local Context in Captioning: Using $k_s=128$ local tokens for the captioner improves [email protected] by +0.19 over global context.
Set-to-Set SCST: Outperforms set-to-sentence training, with a +2.38 [email protected] gain.
Iterative Refinement and Decoupling (DETR++): Further increases in both mAP and [email protected].
NMS Free: Vote2Cap-DETR is robust, yielding [email protected] = 52.8% regardless of NMS application (Chen et al., 2023).

7. Position in the Literature and Limitations

Vote2Cap-DETR and Vote2Cap-DETR++ are the first transformer-based, one-stage architectures for 3D dense captioning that replace multi-step, hand-crafted relation modules with set prediction and spatially-aware transformer heads. This approach avoids error accumulation typical in cascade pipelines with duplicated and inaccurate box proposals in cluttered scenes.

Qualitative analysis shows vote queries cluster near object centers, enhancing detection, while the dual-clued captioner improves attribute and relationship selection by exploiting rich 3D context. Limitations include lack of explicit modeling for rotated bounding boxes and a reliance on CIDEr-driven SCST, which may bias toward n-gram matching rather than conceptual diversity (Chen et al., 2023, Chen et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

End-to-End 3D Dense Captioning with Vote2Cap-DETR (2023)

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vote2Cap-DETR.

Vote2Cap-DETR: 3D Captioning Transformer

1. Model Architecture

2. Decoupled Localization and Captioning: Vote2Cap-DETR++

3. Spatial Information Injection for Captions

4. Training Objectives and Loss Functions

5. Datasets, Evaluation, and Empirical Performance

6. Ablations and Analysis

7. Position in the Literature and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Vote2Cap-DETR: 3D Captioning Transformer

1. Model Architecture

2. Decoupled Localization and Captioning: Vote2Cap-DETR++

3. Spatial Information Injection for Captions

4. Training Objectives and Loss Functions

5. Datasets, Evaluation, and Empirical Performance

6. Ablations and Analysis

7. Position in the Literature and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research