ScanRefer Benchmark: 3D Visual Grounding

Updated 16 April 2026

ScanRefer is a benchmark that evaluates single-object 3D visual grounding by linking free-form language to specific objects in richly labeled indoor RGB-D scans.
It uses rigorous metrics such as IoU, Acc@t, and segmentation F1 to assess performance, addressing challenges in both unique and multiple object references.
Advanced methods like panoptic matching and vision-language fusion demonstrate improved accuracy, yet the benchmark reveals ongoing challenges in complex spatial reasoning.

The ScanRefer benchmark is a standardized evaluation suite for 3D object localization via natural language in RGB-D scans, central to the 3D visual grounding (3DVG) and referring expression comprehension (REC) research communities. Constructed atop ScanNet’s richly labeled 3D reconstructions of indoor scenes, it uniquely correlates raw, free-form language descriptions to specific object instances in real-world 3D data, driving multimodal research at the intersection of geometry, vision, and language.

1. Dataset Composition and Linguistic Characteristics

ScanRefer comprises 51,583 human-authored referring expressions describing 11,046 object instances across 800 unique indoor scenes sampled from ScanNet’s RGB-D corpus (Chen et al., 2019). The scenes are diverse (living rooms, offices, etc.) and contain a broad set of objects: the original taxonomy defines 18 coarse categories (chair, table, sofa, window, etc.), with “chair” being the most frequent. Each object is referenced on average by 4.67 distinct descriptions, with a vocabulary size of ≈4,200 and mean utterance length of ≈20 tokens.

Descriptions are highly compositional: spatial relations are present in 98.7% of queries, color in 74.7%, shape in 64.9%, and size in 14.2%. Comparative (672) and superlative (2,734) cues frequently appear. The tail of the dataset contains especially challenging cases where geometric, visual, and discourse-level cues must be reasoned jointly.

The dataset is split strictly by scenes: 562 training, 141 validation, and 97 test scenes, corresponding respectively to 7,875, 2,068, and 1,103 unique objects, and 36,665, 9,508, and 5,410 descriptions (Chen et al., 2019, Li et al., 2024, Zhang et al., 2023).

2. Task Definition and Annotation Protocol

The primary task in ScanRefer is single-object 3D visual grounding: Given a 3D point cloud $P=\{p_i\}_{i=1,\ldots,N_P}$ with features $f'_i$ (XYZ, RGB, normals, multi-view image features, height) and a free-form language description $d$ (token sequence mapped to 300-d GloVe and encoded by GRU), predict an axis-aligned 3D bounding box $B^{\text{pred}}$ with center and size parameters $(c_x, c_y, c_z, r_x, r_y, r_z)$ corresponding to the described object (Chen et al., 2019, Li et al., 2024).

Annotations were crowdsourced: annotators first segmented objects in ScanNet point clouds, then wrote natural language queries to unambiguously reference each object among its distractors (“the floral chair next to the three-seater couch”). The split between “unique” (one instance of class per scene) and “multiple” (several same-class distractors) is explicit and used for stratified evaluation (Zhang et al., 2023, Xu et al., 2024).

3. Evaluation Metrics and Protocol

Performance is measured using intersection-over-union ( $\mathrm{IoU}$ ) between axis-aligned 3D boxes:

$\mathrm{IoU}(B_{\text{pred}}, B_{\text{gt}}) = \frac{\mathrm{vol}(B_{\text{pred}}\cap B_{\text{gt}})}{\mathrm{vol}(B_{\text{pred}}\cup B_{\text{gt}})}$

Top-$1$ accuracy at threshold $t$ ( $\mathrm{Acc}@t$ ) counts the proportion of samples with $f'_i$ 0, evaluated at $f'_i$ 1 (Chen et al., 2019, Zhang et al., 9 Mar 2026, Li et al., 2024). For segmentation settings (3DRES), mask-level $f'_i$ 2 and $f'_i$ 3 are also employed (Tan et al., 18 Mar 2026).

For multi-object extensions (Zhang et al., 2023), bipartite matching (Hungarian algorithm) aligns predicted and GT boxes, allowing calculation of Precision, Recall, and $f'_i$ 4 at various IoU thresholds. Special handling ensures correct scoring of “zero-target” queries.

4. Baseline and Advanced Methodologies

4.1 Classical Baselines

The canonical baseline (Chen et al., 2019) integrates a VoteNet-style detector to generate up to $f'_i$ 5 candidate bounding boxes from P, extracts 128-d proposal features $f'_i$ 6, fuses each with the 256-d language embedding $f'_i$ 7 via MLP, and scores proposals with a linear softmax head, selecting the top-scoring one. The end-to-end loss combines localization, detection, and auxiliary language-to-object classification, with empirical weights $f'_i$ 8, $f'_i$ 9, $d$ 0.

Voting, proposal and fusion ablations demonstrate limited performance with semantic labels alone (21.88% [email protected]), rising with richer geometry (+normals, multi-view image features) and full descriptions (up to 27.40% [email protected] on val).

4.2 Classical and Panoptic Matching Advances

InstanceRefer (Yuan et al., 2021) replaces proposal-centric matching with panoptic segmentation, predicting candidate instance point clouds using PointGroup. Holistic matching aggregates attributes (AP), instance-to-instance relations (RP), global localization (GLP), and uses deep co-attention, achieving 40.23%/32.93% (val) and 44.27%/35.80% (test) at [email protected]/0.5—substantial improvements over original baselines.

Look Around and Refer (LAR) (Bakr et al., 2022) distills 2D semantics by synthetically rendering multiple views per proposal, encoding these with a Tiny-ConvNeXt, and aligning via a visual transformer. A hybrid five-term loss (object class, referring, correspondence, language class) guides joint 2D–3D multimodal learning, yielding 54.6% referring accuracy in the bounding-box setting (surpassing previous state-of-the-art by +1.0 point).

4.3 Vision-Language and Zero-Shot Grounding

SeeGround (Li et al., 2024) and VLM-Grounder (Xu et al., 2024) exploit pretrained 2D vision-LLMs for 3DVG, eschewing direct 3D supervision. SeeGround employs a Perspective Adaptation Module (PAM) for query-aligned rendering and a Fusion Alignment Module (FAM) to inject explicit object markers into rendered images, feeding paired images and text to a VLM to score the relevant box. This achieves 44.1% [email protected], outperforming prior zero-shot SOTA (ZSVG3D, 36.4%), with ablations confirming the complementary value of PAM and FAM.

VLM-Grounder dynamically stitches multi-view RGB-D images, and uses a feedback/ensemble strategy for projecting grounded 2D masks back to 3D bounding boxes, obtaining 51.6% [email protected], approaching the best fully supervised approaches (Xu et al., 2024).

Open-vocabulary and open-world generalization have been driven by non-3D-supervised models such as UniGround (Zhang et al., 9 Mar 2026), which achieves 46.1%/34.1% [email protected]/0.5 in a purely training-free, prompting-oriented scheme.

4.4 Unified and Dual-Task Grounding

PC-CrossDiff (Tan et al., 18 Mar 2026) unifies 3D bounding and segmentation via point-level and cluster-level bidirectional differential attention, coupled with a multi-task harmonized loss. On the ScanRefer test set, it achieves 58.47%/47.89% (REC, @0.25/@0.50) and 60.41%/52.52% (RES, @0.25/@0.50), with especially strong results on challenging implicit and multiple subsets. Differential attention yields notable improvements over vanilla cross-attention.

5. Extensions and Analysis: Multi-Target, Segmentation, Implicit Cues

Extensions such as Multi3DRefer (Zhang et al., 2023) broaden ScanRefer with zero-, single- and multi-target queries, attribute-type annotations (spatial, color, texture, shape), and a new evaluation metric (F1@t) for multi-grounding. Rescored/rewritten language with increased syntactic diversity enhances the challenge. Contrastive multimodal CLIP-based methods yield +6.8 points over geometry-only baselines in [email protected], and F1 improvements in the multi-target setting.

Segmentation-based benchmarks (ScanRefer 3DRES: (Tan et al., 18 Mar 2026)) deploy per-point binary mask tasks in parallel with bounding-box localization, revealing complementary difficulties and solution pathways.

Difficulties in parsing implicit cues (queries without explicit spatial prepositions) are systematically evaluated, with PC-CrossDiff reporting +10.16% absolute gain over previous best methods on [email protected].

6. Key Challenges, Open Problems, and Best Practices

Oracle baselines (Chen et al., 2019) show that knowledge of the semantic class yields near-perfect accuracy only for “unique” queries; performance collapses with multiple distractors. Even with access to ground-truth boxes, establishing the mapping from language to object is far from solved (~73.5% [email protected] on “unique,” ~32% on “multiple”).

Error analysis identifies failure modes: suboptimal 3D proposals for thin/flat objects (pictures, sinks), ambiguity in complex spatial discourse (“third from the wall,” “shorter than…”), and occlusion. Best practices established include using full multi-sentence descriptions, enriching point features with normals and multi-view semantics, and fusing proposal and language representations with attention (Chen et al., 2019, Yuan et al., 2021). Future benchmarks are expected to prioritize improved spatial reasoning, zero-shot class grounding, and fully unified 2D–3D segmentation/localization.

7. Comparative Performance and Ongoing Evolution

Performance across ScanRefer has trended upwards from original baselines (41.2% [email protected], 27.4% [email protected] (Chen et al., 2019)) to state-of-the-art supervised methods (MCLN 57.2%/45.7%, TSP3D 56.5%/46.7% (Zhang et al., 9 Mar 2026)), advanced panoptic/instance-centric (InstanceRefer 40.2%/32.9% (Yuan et al., 2021)), CLIP-based hybrid models (LAR, 54.6% (Bakr et al., 2022)), dual-task attention architectures (PC-CrossDiff 58.47%/47.89% (Tan et al., 18 Mar 2026)), and the strongest zero-shot VLM-based models (VLM-Grounder, 51.6%/32.8%; SeeGround, 44.1%/39.4% (Xu et al., 2024, Li et al., 2024)).

A tabular summary of selected methods’ overall val-set accuracy is given below:

Method	Supervision	[email protected] (%)	[email protected] (%)
ScanRefer (2020)	Full	41.2	27.4
InstanceRefer	Full	40.2	32.9
MCLN	Full	57.2	45.7
PC-CrossDiff	Full	58.5	47.9
VLM-Grounder	Zero-shot	51.6	32.8
SeeGround	Zero-shot	44.1	39.4
UniGround	Zero-shot/OW	46.1	34.1

This persistent gap between “multiple” and “unique” splits, and the highest scores clustering well below 60%, reflect the intrinsic difficulty of holistic 3DVG under unconstrained natural language and complex indoor geometry (Zhang et al., 9 Mar 2026, Li et al., 2024, Tan et al., 18 Mar 2026, Yuan et al., 2021).

The ScanRefer benchmark thus defines the canonical problem and dataset for single-object 3D visual grounding via language, providing both a standard for comparative analysis and an evolving platform for new algorithmic developments in geometry–language–vision integration. Its enduring challenge has motivated substantial progress in proposal mechanisms, cross-modal fusion, panoptic reasoning, and open-set/compositional generalization.