Papers
Topics
Authors
Recent
Search
2000 character limit reached

ScanRefer: 3D Object Localization Dataset

Updated 5 February 2026
  • ScanRefer is a large-scale dataset enabling direct grounding of free-form natural language expressions to localize objects within reconstructed 3D indoor scenes.
  • It comprises 1,613 RGB-D scans with over 11,000 annotated objects and 51,583 detailed descriptions, offering a rich benchmark for multi-modal scene understanding.
  • The dataset supports advancements in embodied AI, robotics, and AR through rigorous evaluation metrics and protocols for spatial language grounding.

ScanRefer is a large-scale dataset designed for the task of 3D object localization in indoor RGB-D scans using free-form natural language descriptions. It provides the foundational benchmark for research in grounding complex referring expressions directly in full 3D reconstructed environments, enabling rigorous investigation of multi-modal scene understanding, embodied AI, and language-based object localization (Chen et al., 2019).

1. Motivation and Task Definition

The core motivation behind ScanRefer is to enable the localization of arbitrary objects within fully reconstructed 3D scenes using only natural language queries as guidance. Unlike prior efforts restricted to 2D images or those relying on projecting language references onto 3D via intermediate steps, ScanRefer targets direct linguistic grounding within 3D point clouds. The benchmark addresses practical application demands in robotics, human–computer interaction, and AR by requiring models to ground free-form referring expressions—capturing reference via spatial layout, appearance, and semantic context—against complex environments (Chen et al., 2019, Li et al., 2024).

The formal task input is a pair (P,Q)(P, Q), where PP is the point cloud of an indoor scene and QQ is a natural language description mandating localization of a unique target object. The output is a tight, axis-aligned 3D bounding box predicted to match the described instance.

2. Data Composition and Structure

ScanNet Foundation

ScanRefer is constructed atop ScanNet, which provides 1,613 RGB-D scans of 806 distinct indoor environments. After filtering out scenes lacking describable objects, 800 scenes are included in the final dataset (Chen et al., 2019).

Object and Expression Statistics

  • Object Instances: 11,046 objects across 265 ScanNet object classes, subsequently aggregated into 18 common semantic categories (e.g., chair, table, cabinet, plus an “others” class).
  • Natural Language Descriptions: 51,583 free-form, multi-sentence descriptions, averaging 4.67 descriptions per object instance. Descriptions average 20.27 words each, yielding a vocabulary of 4,197 unique tokens.

Dataset Splits

The scenes are partitioned into non-overlapping train/val/test splits for comparative benchmarking:

Split Scenes Objects Descriptions
Train 562 7,875 36,665
Validation 141 2,068 9,508
Test 97 1,103 5,410

All splits are disjoint at the scene level. Each referred object is unambiguously annotated; multiple instances per class per scene may appear, necessitating nuanced referential disambiguation (Chen et al., 2019).

File Organization and Schema

The dataset follows a rigorous directory structure:

1
2
3
4
5
6
7
8
9
10
ScanRefer/
  ├── pointclouds/
  │    ├── sceneXXXX_XX.ply    # reconstructed point cloud
  │    ├── sceneXXXX_XX.npz    # per-point features (RGB, normals, height, 128D multiview)
  ├── annotations/
  │    ├── train.json
  │    ├── val.json
  │    └── test.json
  └── objects/
       └── object_bboxes.json  # ground-truth boxes
Annotation entries are stored in JSON with the following schema:
1
2
3
4
5
6
7
8
9
{
  "scene_id": "scene0101_00",
  "object_id": 17,
  "class_name": "nightstand",
  "bbox": [1.25,0.30,2.10, 1.75,0.60,2.40],
  "description": "This is a short nightstand next to the bed. It has two drawers and is dark brown.",
  "tokens": ["This","is","a","short","nightstand","next","to","the","bed",".","It","has","two","drawers","and","is","dark","brown","."],
  "split": "train"
}
Point cloud data (.ply) contain vertices with XYZRGB; per-point features are stored in .npz format for efficiency (Chen et al., 2019).

3. Annotation Protocol and Linguistic Phenomena

Description Collection

Descriptions are obtained via a web-based 3D UI deployed to English-speaking crowd workers. Target objects are visually highlighted, with all other objects faded for context. Annotators observe the full 3D mesh as well as several 2D perspectives showing color and texture. Each object is described by five distinct workers, instructed to use at least two full sentences emphasizing physical appearance (color, shape, material) and spatial relations to other scene elements (Chen et al., 2019).

Verification and Filtering

Post-collection, trained student annotators perform a verification phase:

  • Presented with a scene and a referring expression, the verifier selects all matching objects.
  • Descriptions matching zero or more than one object are discarded.
  • Spelling and obvious wording mistakes are corrected; additional manual filtering yields a final set with ~2,823 invalid descriptions removed and ~2,129 lightly edited (Chen et al., 2019).

Expression Types

The dataset exhibits a rich diversity of reference:

  • Spatial terms: 98.7% of descriptions
  • Color attributes: 74.7%
  • Shape attributes: 64.9%
  • Explicit size terms: 14.2%
  • Comparatives: 672 instances (e.g., “smaller”, “bigger”)
  • Superlatives: 2,734 (e.g., “largest”, “rightmost”)
  • Ordinal spatial: frequent (“second chair from the door”)

This distribution directly enables fine-grained disambiguation and complex linguistic grounding (Chen et al., 2019).

4. Benchmarking, Metrics, and Baselines

Localization Metrics

The principal evaluation metrics are:

  • 3D Intersection-over-Union (IoU):

IoU=Vol(BpredBgt)Vol(BpredBgt)\mathrm{IoU} = \frac{\mathrm{Vol}(B_\text{pred} \cap B_\text{gt})}{\mathrm{Vol}(B_\text{pred} \cup B_\text{gt})}

  • Accuracy at k-IoU (Acc@k\text{Acc}@ k):

Fraction of queries for which the top-1 predicted bounding box exceeds IoU threshold kk with the ground-truth. Common thresholds: k=0.25,0.50k=0.25, 0.50.

Baseline Performance

On the official validation split, core baseline results are summarized as:

Model [email protected] [email protected]
OracleCatRand 29.99% 29.76%
OracleRefer 40.63% 40.06%
VoteNetRand 10.00% 5.28%
VoteNetBest 55.10% 54.33%
SCRC + backprojection 18.70% 6.45%
One-stage + backprojection 20.38% 9.04%
Ours (end-to-end, ScanRefer) 41.19% 27.40%

Subsequent works such as 3DVG-T, InstanceRefer, BUTD-DETR, and EDA have advanced fully supervised state of the art to over 54% [email protected] and 42% [email protected]. Zero-shot methods such as SeeGround leverage 2D vision–LLMs to reach 44.1%/39.4% on [email protected]/0.5, outperforming prior zero-shot approaches by 7.7 percentage points at [email protected] (Li et al., 2024).

Standard Processing Pipeline

Practitioners typically:

  1. Downsample to 40,000 scene points.
  2. Project multi-view features or use raw RGB, normals, and height.
  3. Encode the referring expression via pretrained GloVe embeddings and a GRU (256D).
  4. Detect 3D proposals (e.g., M=256 via VoteNet).
  5. Fuse proposal features with language embedding via MLP and score with softmax.
  6. Apply 3D NMS and select top-1.
  7. Evaluate against ground-truth using designated metrics (Chen et al., 2019).

Multi3DRefer

Multi3DRefer generalizes ScanRefer, addressing descriptions that refer to zero, single, or multiple target objects, thus reflecting more realistic and ambiguous reference scenarios. It encompasses 61,926 descriptions over the same 800 scenes and 11,609 objects, with:

  • 6,688 zero-target descriptions
  • 42,060 single-target
  • 13,178 multiple-target (average 2 to 5.8 targets per prompt, max 32)

An F1-matching metric (using the Hungarian algorithm over pairwise IoU) is adopted for multi-object evaluation. Multi3DRefer also enriches vocabulary (unique word types: 7,077) and introduces controlled paraphrasing using ChatGPT for linguistic diversity (Zhang et al., 2023).

SeeGround and Zero-Shot Grounding

SeeGround exemplifies zero-shot approaches on ScanRefer, leveraging 2D vision–LLMs by representing 3D scenes as hybrid sets of rendered images combined with 3D-aware text prompts. This method achieves 39.4% [email protected] without any ScanRefer training signal (Li et al., 2024).

2D-3D Projected Tasks

Methods such as “Refer-it-in-RGBD” extract single-view RGB-D images from ScanRefer, enabling 2D-based visual grounding over projections, though such approaches are generally surpassed by direct 3D frameworks for disambiguation and localization performance (Liu et al., 2021).

6. Applications and Impact

ScanRefer is the canonical testbed for:

  • Embodied AI research: language-guided robotic manipulation and navigation in human environments
  • 3D visual grounding: disambiguating references to objects described verbally within real-world rooms
  • Multi-modal understanding: aligning geometric, spatial, and linguistic information without explicit ontological constraints
  • Benchmarking contrastive learning and transformer-based models for spatial language understanding

By providing diverse (in spatial relations, appearance, and semantic context) and richly annotated language–3D pairings, ScanRefer has enabled progress in open-vocabulary vision, improved 3D language grounding algorithms, and fostered the development of architectures supporting multi-reference and ambiguity-tolerant reasoning (Chen et al., 2019, Li et al., 2024, Zhang et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScanRefer Dataset.