Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vote2Cap-DETR: 3D Captioning Transformer

Updated 6 March 2026
  • The paper introduces Vote2Cap-DETR, a unified transformer-based architecture for 3D dense captioning that leverages a learnable vote query mechanism and set prediction.
  • It employs a dual-clued caption head and iterative spatial refinement to improve both caption descriptiveness and localization accuracy on benchmarks like ScanRefer and Nr3D.
  • Empirical evaluations demonstrate significant gains in CIDEr scores and mAP, outperforming traditional detect-then-describe pipelines in complex indoor scenes.

Vote2Cap-DETR is a fully transformer-based end-to-end framework for 3D dense captioning, designed to unify 3D object localization and region-level caption generation within a one-stage architecture. By leveraging a learnable “vote query” mechanism for spatial object anchoring, a dual-clued language head, and set prediction via DETR-style transformers, Vote2Cap-DETR and its advanced variant Vote2Cap-DETR++ demonstrate state-of-the-art performance on standard benchmarks, substantially improving over detect-then-describe pipelines in complex indoor scenes (Chen et al., 2023, Chen et al., 2023).

1. Model Architecture

Input to Vote2Cap-DETR is a raw point-cloud PC={(pini,fini)}i=1NPC = \{(p_{in}^i, f_{in}^i)\}_{i=1}^N with 3D coordinates piniR3p_{in}^i \in \mathbb{R}^3 and per-point features finiRFf_{in}^i \in \mathbb{R}^F. Downsampling to 2,048 points is performed via PointNet++ set abstraction. Scene encoding uses a 3DETR-style backbone: three masked self-attention layers interleaved with local set-abstraction, yielding M=1,024M = 1,024 scene tokens {(pencj,fencj)}j=1M\{(p_{enc}^j, f_{enc}^j)\}_{j=1}^M, where fencjR256f_{enc}^j \in \mathbb{R}^{256}.

Vote-query generation identifies K=256K=256 seed points pseedkp_{seed}^k by Farthest Point Sampling (FPS) over the encoded scene, then applies an MLP FFNvoteFFN_{vote} to seed features fseedkf_{seed}^k to predict spatial offsets Δpvotek\Delta p_{vote}^k, producing 3D anchors pvqk=pseedk+FFNvote(fseedk)p_{vq}^k = p_{seed}^k + FFN_{vote}(f_{seed}^k). Local PointNet++-style feature aggregation around pvqkp_{vq}^k yields fvqkf_{vq}^k.

These (position, feature) pairs act as queries for an 8-layer transformer decoder. Each decoder layer updates query features as

fq(i)=DecoderLayeri(fq(i1)+PE(pvq)),f_{q}^{(i)} = DecoderLayer_i(f_{q}^{(i-1)} + PE(p_{vq})),

with PE()PE(\cdot) a 3D Fourier positional encoding (Chen et al., 2023).

Two heads operate in parallel:

  • Detection head: Per-query MLPs regress 3D box corners, center offsets, and semantic class scores.
  • Caption head (Dual-Clued Captioner): A 2-layer transformer decoder that attends to the query’s final feature Vq=fq(L)\mathcal{V}^q = f_{q}^{(L)} and a local context set Vs\mathcal{V}^s of the ksk_s nearest scene-token features, decoding captions autoregressively (Chen et al., 2023).

2. Decoupled Localization and Captioning: Vote2Cap-DETR++

Vote2Cap-DETR++ introduces explicit decoupling of queries for localization ([LOC][LOC]) and captioning ([CAP][CAP]). At each decoder layer, matching [CAP][CAP] queries are generated from [LOC][LOC] via a learned projection:

fcap(i)=Wprojfloc(i),pcap(i)=ploc(i).f_{cap}^{(i)} = W_{proj} f_{loc}^{(i)}, \quad p_{cap}^{(i)} = p_{loc}^{(i)}.

Both sets traverse the same decoding block, but each prediction head operates only over its designated query set, with positions tied for correspondence.

To refine object spatial anchors, iterative spatial refinement updates ploc(i)p_{loc}^{(i)} at each layer:

ploc(i)=ploc(i1)+FFNrefine(floc(i1)).p_{loc}^{(i)} = p_{loc}^{(i-1)} + FFN_{refine}(f_{loc}^{(i-1)}).

This strategy incrementally drifts queries toward ground-truth centers, empirically increasing mean average precision (mAP) and speeding convergence (Chen et al., 2023).

3. Spatial Information Injection for Captions

Vote2Cap-DETR++ enhances descriptiveness and localization precision by integrating spatial priors into the captioning process. An absolute position token PE(pcap(L))PE(p_{cap}^{(L)}) is prepended to the caption decoder input, explicitly encoding 3D location. For the local context {vjs}\{v_j^s\}, rank-based embeddings Vposs\mathcal{V}^s_{pos}, learned as a function of each token’s distance rank to pcap(L)p_{cap}^{(L)}, are added to the context features:

c=argmaxcP(cVs,Vposs;Vq,Vposq).c^* = \arg\max_c P(c \mid \mathcal{V}^s, \mathcal{V}^s_{pos};\,\mathcal{V}^q, \mathcal{V}^q_{pos}).

This injects geometric hierarchy into linguistic generation and improves spatial phrase accuracy (Chen et al., 2023).

4. Training Objectives and Loss Functions

Optimization supervises three principal modules:

  • Vote Query Loss:

Lvq=1Mi=1Mj=1Ngtpvoteicjcenter11{penciobjj}\mathcal{L}_{vq} = \frac1M \sum_{i=1}^M \sum_{j=1}^{N_{gt}} \|p_{vote}^i - c_j^{center}\|_1\, \mathbf{1}\{p_{enc}^i \in {\rm obj}\,j\}

  • Detection Set Loss: After Hungarian matching,

Ldet=α1Lgiou+α2Lcls+α3Lcenter+α4Lsize\mathcal{L}_{det} = \alpha_1 \mathcal{L}_{giou} + \alpha_2 \mathcal{L}_{cls} + \alpha_3 \mathcal{L}_{center} + \alpha_4 \mathcal{L}_{size}

  • Query Refinement Loss (Vote2Cap-DETR++ only): Matches Lvq\mathcal{L}_{vq}, but applied recursively per decoder layer to ploc(i)p_{loc}^{(i)}.

For captioning, two objectives are used:

LcapMLE=t=1TlogP(wtw1:t1,V)\mathcal{L}_{cap}^{\rm MLE} = -\sum_{t=1}^T \log P(w_t \mid w_{1:t-1}, \mathcal{V})

  • SCST Loss (Self-Critical Sequence Training, using CIDEr reward):

LcapSCST=i=1k(R(c^i)R(g^))1c^ilogP(c^iV)\mathcal{L}_{cap}^{\rm SCST} = -\sum_{i=1}^k (R(\hat{c}_i) - R(\hat{g})) \frac{1}{|\hat{c}_i|} \log P(\hat{c}_i \mid \mathcal{V})

The multi-task objective is

Ltotal=λlocLloc+λcapLcap,\mathcal{L}_{\rm total} = \lambda_{loc}\, \mathcal{L}_{loc} + \lambda_{cap}\,\mathcal{L}_{cap},

where Lloc=Lvq+Ldet\mathcal{L}_{loc} = \mathcal{L}_{vq} + \mathcal{L}_{det} (plus refinement loss as appropriate) and (λloc,λcap)(\lambda_{loc}, \lambda_{cap}) are validation-tuned (Chen et al., 2023, Chen et al., 2023).

5. Datasets, Evaluation, and Empirical Performance

Vote2Cap-DETR and Vote2Cap-DETR++ are evaluated on ScanRefer and Nr3D—benchmarks for 3D region-level captioning and localization. Key metrics are:

  • [email protected]: CIDEr@IoU≥0.5, measuring caption quality conditioned on sufficiently accurate region localization.
  • [email protected]: Standard mean average precision for 3D object detection with IoU≥0.5.

Summarized results:

Method Dataset [email protected] (SCST) [email protected]
VoteNet+DtD baseline ScanRefer ≈46% ≈38%
Vote2Cap-DETR ScanRefer 73.8% 52.1%
Vote2Cap-DETR++ ScanRefer 78.2% 55.5%
Best prior (SCST) Nr3D ≈38%
Vote2Cap-DETR Nr3D 45.5%
Vote2Cap-DETR++ Nr3D 47.6%

These models consistently outperform previous detect-then-describe pipelines on both localization and captioning (Chen et al., 2023, Chen et al., 2023).

6. Ablations and Analysis

Ablations confirm the contribution of core architectural elements:

7. Position in the Literature and Limitations

Vote2Cap-DETR and Vote2Cap-DETR++ are the first transformer-based, one-stage architectures for 3D dense captioning that replace multi-step, hand-crafted relation modules with set prediction and spatially-aware transformer heads. This approach avoids error accumulation typical in cascade pipelines with duplicated and inaccurate box proposals in cluttered scenes.

Qualitative analysis shows vote queries cluster near object centers, enhancing detection, while the dual-clued captioner improves attribute and relationship selection by exploiting rich 3D context. Limitations include lack of explicit modeling for rotated bounding boxes and a reliance on CIDEr-driven SCST, which may bias toward n-gram matching rather than conceptual diversity (Chen et al., 2023, Chen et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vote2Cap-DETR.