Multi3DRefer: 3D Visual Grounding Benchmark

Updated 2 December 2025

Multi3DRefer is a benchmark and dataset for generalized 3D visual grounding that supports flexible, open-set text-to-3D-object localization across zero, single, and multi-target scenarios.
It features 800 scenes with 11,609 object instances and 61,926 text descriptions, incorporating specialized splits for zero-target, single-target, and multi-target cases.
The benchmark uses F1-score evaluations at multiple IoU thresholds to drive advances in multi-modal fusion, relational reasoning, and embodied 3D scene understanding.

Multi3DRefer is a benchmark and dataset for generalized 3D visual grounding, targeting the problem of localizing zero, one, or multiple objects in a real-world scene based on a natural language description. Unlike classical 3D referring tasks, which restrict queries to one-object outputs, Multi3DRefer covers open-set, variable-cardinality text-to-3D-object grounding, thereby better reflecting requirements for embodied agents and natural interaction in complex environments. The benchmark has catalyzed substantial advancements in 3D vision-language modeling, motivating architectures and training regimes engineered for compositional, relational, and multi-instance reasoning over 3D scenes.

1. Task Definition and Dataset Construction

The Multi3DRefer benchmark formalizes the task as follows. Given a colored point cloud $P\in\mathbb{R}^{N\times(3+C)}$ (where $N$ is the number of points and $C$ encodes properties such as colors or normals) and a natural language description $T$ (a sequence of $L$ tokens), the system predicts a set of axis-aligned 3D bounding boxes $\{B_i^{\text{pred}}\}_{i=1}^K$ in world coordinates. The cardinality $K$ is flexible: $K=0$ (no object matches), $K=1$ (single object), $K>1$ (multiple objects).

Data construction leverages and generalizes ScanRefer, introducing new annotation protocols to mine ambiguous cases (for multi-target), create verified zero-target splits by cross-pairing and manual review, and expand expression space via ChatGPT-based rephrasing. The final corpus comprises 800 scenes with 11,609 distinct object instances and 61,926 text descriptions, distributed as follows:

Split	Descriptions
Zero-target (ZT)	6,688
Single-target (ST)	42,060
Multi-target (MT)	13,178

The vocabulary expands through rephrasing (5,067 $\rightarrow$ 7,077 unique words), and expressions average ~15 tokens post-polishing (Zhang et al., 2023).

2. Evaluation Metrics and Protocol

The benchmark defines an F1-score based metric for flexible-cardinality grounding at two IoU thresholds ( $\tau=0.25, 0.5$ ). True positive matches are obtained by Hungarian-matching between predicted and ground-truth box sets using IoU as the cost. Precision, recall, and F1 are then computed as usual. Special rules assign perfect scores in zero-target cases only if no prediction is made, reflecting the need for explicit abstention when queries are non-referential.

Results are reported separately on five challenging sub-splits:

ZT w/o distractors: no matching objects, no same-class distractors present
ZT w/ distractors: no matching objects, distractors present
ST w/o distractors: single match, no distractors
ST w/ distractors: single match, with distractors
MT: multiple matches

This provides fine-grained insight into model robustness under ambiguity, distractor confusion, and relational compositionality (Zhang et al., 2023).

3. Baseline and Progression of Approaches

Initial baselines: The original M3DRef-CLIP baseline integrates 3D detector proposals (PointGroup), per-object online rendered CLIP image features, and transformer fusion with a CLIP-based text encoder. A symmetric InfoNCE loss aligns sentence and object mean-pool features. This "expert model" yields 38.4% [email protected] on Multi3DRefer (predicted boxes, val set) (Zhang et al., 2023). Ablations highlight the necessity of fusing both 3D and 2D features, as CLIP-only or 3D-only underperform.

Unified architectures and LLM-based methods: The field rapidly evolved toward models with multi-modal, prompt-based, and graph/relational reasoning—driven by the limitations of dense patch-based fusion and the need for explicit object- and relation-centric modeling.

Notable further advances include:

Point Linguist Model (PLM): Decomposes the system into object-centric discriminative representation (OcDR), an LLM-based cross-modal reasoning (LLaMA2-7B+LoRA), and a geometric reactivation decoder (GRD). OcDR produces tokens that carry object-level semantics and scene relation cues, trained with hard negative ("distractor") mining to encourage fine discrimination. Segmentation masks are reconstructed by mixing LLM-inferred geometry and preserved dense features. PLM achieves 42.1 mIoU on Multi3DRefer—a +6.0 mIoU improvement over SegPoint. Ablation demonstrates critical roles for both object-centric input/output and hard negative supervision (Huang et al., 9 Sep 2025).
PQ3D: Unified segment-level grouping aligns voxels, point clouds, and rendered images. Attention-based promptable query decoders retrieve task-specific information, and universal output heads enable joint training over diverse tasks. PQ3D sets a new state-of-the-art with 50.1% [email protected] (val), outperforming prior bests by +11.7 points, and confirms additive gains from multi-modal feature unification (Zhu et al., 19 May 2024).
Descrip3D: Augments each scene object with a natural-language relational description, generated via a vision-LLM and encoded into scene tokens and prompt context. Dual-level integration (embedding fusion and prompt-level injection) enhances LLM reasoning about compositional, relational queries. Descrip3D achieves 55.1% [email protected], excelling at multi-object relational grounding (Xue et al., 19 Jul 2025).
3DGraphLLM: Constructs a compact $k$ -NN semantic scene graph for each 3D scene, encoding both node and edge (relation) features using multi-view 2D and 3D geomeric features. These are directly serialized as LLM tokens. 3DGraphLLM (LLAMA3-8B) achieves 58.2% [email protected], with semantic edges giving a 3-point gain over node-only encoding (Zemskova et al., 24 Dec 2024).
Robin3D: Leverages a Robust Instruction Generation (RIG) engine—collecting 344K adversarial and 508K diverse instruction samples—to fine-tune a unique 3D-LLM. Key mechanisms include a Relation-Augmented Projector (RAP) and ID-Feature Bonding (IFB) to tie spatial context and object identity explicitly to tokens. Robin3D attains 59.7% [email protected], with adversarial data boosting performance by nearly 5 points (Kang et al., 30 Sep 2024).
Video-3D LLM: Treats each 3D scan as a coordinate-augmented video, injecting 3D position encodings into per-patch features for a pre-trained video LLM. Maximum-coverage greedy frame sampling further improves efficiency and recall. Uniform sampling yields 52.7% [email protected] (val set), with especially high recall on zero-target splits (Zheng et al., 30 Nov 2024).
MoE3D: Adopts a mixture-of-experts superpoint transformer, where each 3D superpoint token is adaptively routed to modality- or structure-specialized experts using Top-1 gating. Information-aggregation modules unify 3D geometry and language, followed by instruction-tuned LLMs. Progressive pre-training (2D $\rightarrow$ 3D alignment, superpoint pre-training, then unified LoRA-based instruction tuning) supports effective transfer. MoE3D achieves 48.8% mIoU, improving by +6.1 over the previous best 3D-LLaVA (Li et al., 27 Nov 2025).

4. Analysis of Methodological Advances

Key advances have systematically addressed Multi3DRefer’s core challenges:

Object- and relation-centric modeling: PLM’s OcDR, Descrip3D’s dual-level language integration, and 3DGraphLLM’s semantic scene graphs all demonstrate that imbuing object tokens with relational, context-aware semantics significantly boosts performance in compositional multi-object queries. Explicit handling of distractors (hard-negative mining, adversarial data, semantic edges) is essential to robust grounding.
Flexible input/output modalities: PQ3D and Video-3D LLM permit training and inference on diverse, missing, or incomplete modalities—improving generality, practical deployment, and robustness. PQ3D’s segment-level alignment and promptable queries, and Video-3D LLM’s frame sampling, confer resilience to partial or noisy sensor input.
Mixture-of-experts and modularity: MoE3D’s sparse, dynamically selected experts efficiently handle the heterogeneous, structured nature of indoor 3D scenes and high semantic diversity; ablations show that appropriate specialization and regularization (e.g., z-loss, load balancing) are essential for superior generalization.
Scale and diversity in training: Robin3D highlights the importance of instruction diversity and adversarial sampling for capturing the full spectrum of Multi3DRefer query-types (including "no match" and ambiguous queries). Data diversity directly correlates with downstream generalization (Kang et al., 30 Sep 2024).
Efficient fusion: Early-stage cross-modal alignment (e.g., in PLM and PQ3D), query-centric decoding, and late-stage language-in-the-loop mask decoding (MoE3D, PLM) ensure that geometric fidelity is maintained throughout the pipeline. Maintaining “object placeholders” or ID tokens prevents geometric/semantic drift between modalities.

5. Quantitative Results and Benchmark State-of-the-Art

Below is a summary of leading models on Multi3DRefer (official validation splits), presenting the main figures of merit for [email protected] and [email protected].

Model	[email protected]	[email protected]	mIoU	Reference
M3DRef-CLIP	42.8	38.4	—	(Zhang et al., 2023)
PQ3D (unified)	—	50.1	—	(Zhu et al., 19 May 2024)
Video-3D LLM	58.0	52.7	—	(Zheng et al., 30 Nov 2024)
Descrip3D	59.4	55.1	—	(Xue et al., 19 Jul 2025)
3DGraphLLM	63.0	58.2	—	(Zemskova et al., 24 Dec 2024)
Robin3D	64.9	59.7	—	(Kang et al., 30 Sep 2024)
MoE3D	—	—	48.8	(Li et al., 27 Nov 2025)
PLM	—	—	42.1	(Huang et al., 9 Sep 2025)

Note: PLM and MoE3D primarily report mIoU (mean intersection-over-union) rather than [email protected] on predicted bounding boxes; this table aligns with the original reporting conventions of each source.

Improvements of +6–11 points (F1 or mIoU) have been achieved within 18 months, denoting substantial advances in open-set, multi-object language grounding in 3D.

6. Limitations, Failure Modes, and Directions

Common weaknesses across state-of-the-art approaches include:

Multi-object relational ambiguity: Even models with explicit relational context (e.g., 3DGraphLLM, Descrip3D) may miss objects when the cardinality or set structure is complex, or relations are occluded or unobserved (Zemskova et al., 24 Dec 2024, Xue et al., 19 Jul 2025).
Zero-target and distractor confusion: Performance often drops when the scene contains semantically similar distractors, or when the "no match" case must be robustly detected under vague language (Zhang et al., 2023, Zhu et al., 19 May 2024).
Token/window bottlenecks: Scene graph or prompt-based architectures can incur token or memory overheads, especially for scenes with >100 objects or higher-order neighbor connections (Zemskova et al., 24 Dec 2024).
Expert specialization and routing: In MoE3D, over-fragmentation or poorly regulated routing (e.g., without proper z-loss) can degrade performance. Expert count and placement are critical hyperparameters (Li et al., 27 Nov 2025).
Data resource dependency: Methods leveraging large-scale, diverse or adversarial instruction sets (Robin3D) are sensitive to data quality, coverage, and the anchoring of synthetic negative samples (Kang et al., 30 Sep 2024).

A plausible implication is that further progress will require (i) scalable, memory-efficient object and relation encoding strategies, (ii) robust open-vocabulary and attribute-level grounding, and (iii) dynamic, context-adaptive reasoning mechanisms.

7. Impact and Research Significance

Multi3DRefer has established itself as the de facto standard for evaluating generalized 3D language grounding under unconstrained cardinality. Its challenging splits and detailed metric reporting have driven innovation spanning explicit object-relation modeling, unified modality fusion, mixture-of-experts design, and instruction-driven learning strategies. The benchmark catalyzed methods that unify multi-task 3D vision-language understanding—enabling advances across segmentation, QA, captioning, and more. Its design and ongoing evolution continue to foster research at the intersection of scene understanding, embodied AI, and multimodal reasoning (Zhang et al., 2023, Huang et al., 9 Sep 2025, Zhu et al., 19 May 2024, Zheng et al., 30 Nov 2024, Kang et al., 30 Sep 2024, Zemskova et al., 24 Dec 2024, Xue et al., 19 Jul 2025, Li et al., 27 Nov 2025).