ScanRefer: 3D Object Localization Dataset

Updated 5 February 2026

ScanRefer is a large-scale dataset enabling direct grounding of free-form natural language expressions to localize objects within reconstructed 3D indoor scenes.
It comprises 1,613 RGB-D scans with over 11,000 annotated objects and 51,583 detailed descriptions, offering a rich benchmark for multi-modal scene understanding.
The dataset supports advancements in embodied AI, robotics, and AR through rigorous evaluation metrics and protocols for spatial language grounding.

ScanRefer is a large-scale dataset designed for the task of 3D object localization in indoor RGB-D scans using free-form natural language descriptions. It provides the foundational benchmark for research in grounding complex referring expressions directly in full 3D reconstructed environments, enabling rigorous investigation of multi-modal scene understanding, embodied AI, and language-based object localization (Chen et al., 2019).

1. Motivation and Task Definition

The core motivation behind ScanRefer is to enable the localization of arbitrary objects within fully reconstructed 3D scenes using only natural language queries as guidance. Unlike prior efforts restricted to 2D images or those relying on projecting language references onto 3D via intermediate steps, ScanRefer targets direct linguistic grounding within 3D point clouds. The benchmark addresses practical application demands in robotics, human–computer interaction, and AR by requiring models to ground free-form referring expressions—capturing reference via spatial layout, appearance, and semantic context—against complex environments (Chen et al., 2019, Li et al., 2024).

The formal task input is a pair $(P, Q)$ , where $P$ is the point cloud of an indoor scene and $Q$ is a natural language description mandating localization of a unique target object. The output is a tight, axis-aligned 3D bounding box predicted to match the described instance.

2. Data Composition and Structure

ScanNet Foundation

ScanRefer is constructed atop ScanNet, which provides 1,613 RGB-D scans of 806 distinct indoor environments. After filtering out scenes lacking describable objects, 800 scenes are included in the final dataset (Chen et al., 2019).

Object and Expression Statistics

Object Instances: 11,046 objects across 265 ScanNet object classes, subsequently aggregated into 18 common semantic categories (e.g., chair, table, cabinet, plus an “others” class).
Natural Language Descriptions: 51,583 free-form, multi-sentence descriptions, averaging 4.67 descriptions per object instance. Descriptions average 20.27 words each, yielding a vocabulary of 4,197 unique tokens.

Dataset Splits

The scenes are partitioned into non-overlapping train/val/test splits for comparative benchmarking:

Split	Scenes	Objects	Descriptions
Train	562	7,875	36,665
Validation	141	2,068	9,508
Test	97	1,103	5,410

All splits are disjoint at the scene level. Each referred object is unambiguously annotated; multiple instances per class per scene may appear, necessitating nuanced referential disambiguation (Chen et al., 2019).

File Organization and Schema

The dataset follows a rigorous directory structure:

ScanRefer/
  ├── pointclouds/
  │    ├── sceneXXXX_XX.ply    # reconstructed point cloud
  │    ├── sceneXXXX_XX.npz    # per-point features (RGB, normals, height, 128D multiview)
  ├── annotations/
  │    ├── train.json
  │    ├── val.json
  │    └── test.json
  └── objects/
       └── object_bboxes.json  # ground-truth boxes

Annotation entries are stored in JSON with the following schema:

{
  "scene_id": "scene0101_00",
  "object_id": 17,
  "class_name": "nightstand",
  "bbox": [1.25,0.30,2.10, 1.75,0.60,2.40],
  "description": "This is a short nightstand next to the bed. It has two drawers and is dark brown.",
  "tokens": ["This","is","a","short","nightstand","next","to","the","bed",".","It","has","two","drawers","and","is","dark","brown","."],
  "split": "train"
}

Point cloud data (.ply) contain vertices with XYZRGB; per-point features are stored in .npz format for efficiency (Chen et al., 2019).

3. Annotation Protocol and Linguistic Phenomena

Description Collection

Descriptions are obtained via a web-based 3D UI deployed to English-speaking crowd workers. Target objects are visually highlighted, with all other objects faded for context. Annotators observe the full 3D mesh as well as several 2D perspectives showing color and texture. Each object is described by five distinct workers, instructed to use at least two full sentences emphasizing physical appearance (color, shape, material) and spatial relations to other scene elements (Chen et al., 2019).

Verification and Filtering

Post-collection, trained student annotators perform a verification phase:

Presented with a scene and a referring expression, the verifier selects all matching objects.
Descriptions matching zero or more than one object are discarded.
Spelling and obvious wording mistakes are corrected; additional manual filtering yields a final set with ~2,823 invalid descriptions removed and ~2,129 lightly edited (Chen et al., 2019).

Expression Types

The dataset exhibits a rich diversity of reference:

Spatial terms: 98.7% of descriptions
Color attributes: 74.7%
Shape attributes: 64.9%
Explicit size terms: 14.2%
Comparatives: 672 instances (e.g., “smaller”, “bigger”)
Superlatives: 2,734 (e.g., “largest”, “rightmost”)
Ordinal spatial: frequent (“second chair from the door”)

This distribution directly enables fine-grained disambiguation and complex linguistic grounding (Chen et al., 2019).

4. Benchmarking, Metrics, and Baselines

Localization Metrics

The principal evaluation metrics are:

3D Intersection-over-Union (IoU):

$\mathrm{IoU} = \frac{\mathrm{Vol}(B_\text{pred} \cap B_\text{gt})}{\mathrm{Vol}(B_\text{pred} \cup B_\text{gt})}$

Accuracy at k-IoU ( $\text{Acc}@ k$ ):

Fraction of queries for which the top-1 predicted bounding box exceeds IoU threshold $k$ with the ground-truth. Common thresholds: $k=0.25, 0.50$ .

Recall@5: Proportion of samples for which any of the top 5 predictions exceed a 0.5 IoU threshold (Liu et al., 2021, Chen et al., 2019, Li et al., 2024).

Baseline Performance

On the official validation split, core baseline results are summarized as:

Model	[email protected]	[email protected]
OracleCatRand	29.99%	29.76%
OracleRefer	40.63%	40.06%
VoteNetRand	10.00%	5.28%
VoteNetBest	55.10%	54.33%
SCRC + backprojection	18.70%	6.45%
One-stage + backprojection	20.38%	9.04%
Ours (end-to-end, ScanRefer)	41.19%	27.40%

Subsequent works such as 3DVG-T, InstanceRefer, BUTD-DETR, and EDA have advanced fully supervised state of the art to over 54% [email protected] and 42% [email protected]. Zero-shot methods such as SeeGround leverage 2D vision–LLMs to reach 44.1%/39.4% on [email protected]/0.5, outperforming prior zero-shot approaches by 7.7 percentage points at [email protected] (Li et al., 2024).

Standard Processing Pipeline

Practitioners typically:

Downsample to 40,000 scene points.
Project multi-view features or use raw RGB, normals, and height.
Encode the referring expression via pretrained GloVe embeddings and a GRU (256D).
Detect 3D proposals (e.g., M=256 via VoteNet).
Fuse proposal features with language embedding via MLP and score with softmax.
Apply 3D NMS and select top-1.
Evaluate against ground-truth using designated metrics (Chen et al., 2019).

Multi3DRefer

Multi3DRefer generalizes ScanRefer, addressing descriptions that refer to zero, single, or multiple target objects, thus reflecting more realistic and ambiguous reference scenarios. It encompasses 61,926 descriptions over the same 800 scenes and 11,609 objects, with:

6,688 zero-target descriptions
42,060 single-target
13,178 multiple-target (average 2 to 5.8 targets per prompt, max 32)

An F1-matching metric (using the Hungarian algorithm over pairwise IoU) is adopted for multi-object evaluation. Multi3DRefer also enriches vocabulary (unique word types: 7,077) and introduces controlled paraphrasing using ChatGPT for linguistic diversity (Zhang et al., 2023).

SeeGround and Zero-Shot Grounding

SeeGround exemplifies zero-shot approaches on ScanRefer, leveraging 2D vision–LLMs by representing 3D scenes as hybrid sets of rendered images combined with 3D-aware text prompts. This method achieves 39.4% [email protected] without any ScanRefer training signal (Li et al., 2024).

2D-3D Projected Tasks

Methods such as “Refer-it-in-RGBD” extract single-view RGB-D images from ScanRefer, enabling 2D-based visual grounding over projections, though such approaches are generally surpassed by direct 3D frameworks for disambiguation and localization performance (Liu et al., 2021).

6. Applications and Impact

ScanRefer is the canonical testbed for:

Embodied AI research: language-guided robotic manipulation and navigation in human environments
3D visual grounding: disambiguating references to objects described verbally within real-world rooms
Multi-modal understanding: aligning geometric, spatial, and linguistic information without explicit ontological constraints
Benchmarking contrastive learning and transformer-based models for spatial language understanding

By providing diverse (in spatial relations, appearance, and semantic context) and richly annotated language–3D pairings, ScanRefer has enabled progress in open-vocabulary vision, improved 3D language grounding algorithms, and fostered the development of architectures supporting multi-reference and ambiguity-tolerant reasoning (Chen et al., 2019, Li et al., 2024, Zhang et al., 2023).

Markdown Upgrade to Chat

References (4)

ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language (2019)

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding (2024)

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images (2021)

Multi3DRefer: Grounding Text Description to Multiple 3D Objects (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScanRefer Dataset.

ScanRefer: 3D Object Localization Dataset

1. Motivation and Task Definition

2. Data Composition and Structure

ScanNet Foundation

Object and Expression Statistics

Dataset Splits

File Organization and Schema

3. Annotation Protocol and Linguistic Phenomena

Description Collection

Verification and Filtering

Expression Types

4. Benchmarking, Metrics, and Baselines

Localization Metrics

Baseline Performance

Standard Processing Pipeline

Multi3DRefer

SeeGround and Zero-Shot Grounding

2D-3D Projected Tasks

6. Applications and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ScanRefer: 3D Object Localization Dataset

1. Motivation and Task Definition

2. Data Composition and Structure

ScanNet Foundation

Object and Expression Statistics

Dataset Splits

File Organization and Schema

3. Annotation Protocol and Linguistic Phenomena

Description Collection

Verification and Filtering

Expression Types

4. Benchmarking, Metrics, and Baselines

Localization Metrics

Baseline Performance

Standard Processing Pipeline

5. Extensions and Related Developments

Multi3DRefer

SeeGround and Zero-Shot Grounding

2D-3D Projected Tasks

6. Applications and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research