ScanRefer: 3D Object Localization Dataset
- ScanRefer is a large-scale dataset enabling direct grounding of free-form natural language expressions to localize objects within reconstructed 3D indoor scenes.
- It comprises 1,613 RGB-D scans with over 11,000 annotated objects and 51,583 detailed descriptions, offering a rich benchmark for multi-modal scene understanding.
- The dataset supports advancements in embodied AI, robotics, and AR through rigorous evaluation metrics and protocols for spatial language grounding.
ScanRefer is a large-scale dataset designed for the task of 3D object localization in indoor RGB-D scans using free-form natural language descriptions. It provides the foundational benchmark for research in grounding complex referring expressions directly in full 3D reconstructed environments, enabling rigorous investigation of multi-modal scene understanding, embodied AI, and language-based object localization (Chen et al., 2019).
1. Motivation and Task Definition
The core motivation behind ScanRefer is to enable the localization of arbitrary objects within fully reconstructed 3D scenes using only natural language queries as guidance. Unlike prior efforts restricted to 2D images or those relying on projecting language references onto 3D via intermediate steps, ScanRefer targets direct linguistic grounding within 3D point clouds. The benchmark addresses practical application demands in robotics, human–computer interaction, and AR by requiring models to ground free-form referring expressions—capturing reference via spatial layout, appearance, and semantic context—against complex environments (Chen et al., 2019, Li et al., 2024).
The formal task input is a pair , where is the point cloud of an indoor scene and is a natural language description mandating localization of a unique target object. The output is a tight, axis-aligned 3D bounding box predicted to match the described instance.
2. Data Composition and Structure
ScanNet Foundation
ScanRefer is constructed atop ScanNet, which provides 1,613 RGB-D scans of 806 distinct indoor environments. After filtering out scenes lacking describable objects, 800 scenes are included in the final dataset (Chen et al., 2019).
Object and Expression Statistics
- Object Instances: 11,046 objects across 265 ScanNet object classes, subsequently aggregated into 18 common semantic categories (e.g., chair, table, cabinet, plus an “others” class).
- Natural Language Descriptions: 51,583 free-form, multi-sentence descriptions, averaging 4.67 descriptions per object instance. Descriptions average 20.27 words each, yielding a vocabulary of 4,197 unique tokens.
Dataset Splits
The scenes are partitioned into non-overlapping train/val/test splits for comparative benchmarking:
| Split | Scenes | Objects | Descriptions |
|---|---|---|---|
| Train | 562 | 7,875 | 36,665 |
| Validation | 141 | 2,068 | 9,508 |
| Test | 97 | 1,103 | 5,410 |
All splits are disjoint at the scene level. Each referred object is unambiguously annotated; multiple instances per class per scene may appear, necessitating nuanced referential disambiguation (Chen et al., 2019).
File Organization and Schema
The dataset follows a rigorous directory structure:
1 2 3 4 5 6 7 8 9 10 |
ScanRefer/
├── pointclouds/
│ ├── sceneXXXX_XX.ply # reconstructed point cloud
│ ├── sceneXXXX_XX.npz # per-point features (RGB, normals, height, 128D multiview)
├── annotations/
│ ├── train.json
│ ├── val.json
│ └── test.json
└── objects/
└── object_bboxes.json # ground-truth boxes |
1 2 3 4 5 6 7 8 9 |
{
"scene_id": "scene0101_00",
"object_id": 17,
"class_name": "nightstand",
"bbox": [1.25,0.30,2.10, 1.75,0.60,2.40],
"description": "This is a short nightstand next to the bed. It has two drawers and is dark brown.",
"tokens": ["This","is","a","short","nightstand","next","to","the","bed",".","It","has","two","drawers","and","is","dark","brown","."],
"split": "train"
} |
3. Annotation Protocol and Linguistic Phenomena
Description Collection
Descriptions are obtained via a web-based 3D UI deployed to English-speaking crowd workers. Target objects are visually highlighted, with all other objects faded for context. Annotators observe the full 3D mesh as well as several 2D perspectives showing color and texture. Each object is described by five distinct workers, instructed to use at least two full sentences emphasizing physical appearance (color, shape, material) and spatial relations to other scene elements (Chen et al., 2019).
Verification and Filtering
Post-collection, trained student annotators perform a verification phase:
- Presented with a scene and a referring expression, the verifier selects all matching objects.
- Descriptions matching zero or more than one object are discarded.
- Spelling and obvious wording mistakes are corrected; additional manual filtering yields a final set with ~2,823 invalid descriptions removed and ~2,129 lightly edited (Chen et al., 2019).
Expression Types
The dataset exhibits a rich diversity of reference:
- Spatial terms: 98.7% of descriptions
- Color attributes: 74.7%
- Shape attributes: 64.9%
- Explicit size terms: 14.2%
- Comparatives: 672 instances (e.g., “smaller”, “bigger”)
- Superlatives: 2,734 (e.g., “largest”, “rightmost”)
- Ordinal spatial: frequent (“second chair from the door”)
This distribution directly enables fine-grained disambiguation and complex linguistic grounding (Chen et al., 2019).
4. Benchmarking, Metrics, and Baselines
Localization Metrics
The principal evaluation metrics are:
- 3D Intersection-over-Union (IoU):
- Accuracy at k-IoU ():
Fraction of queries for which the top-1 predicted bounding box exceeds IoU threshold with the ground-truth. Common thresholds: .
- Recall@5: Proportion of samples for which any of the top 5 predictions exceed a 0.5 IoU threshold (Liu et al., 2021, Chen et al., 2019, Li et al., 2024).
Baseline Performance
On the official validation split, core baseline results are summarized as:
| Model | [email protected] | [email protected] |
|---|---|---|
| OracleCatRand | 29.99% | 29.76% |
| OracleRefer | 40.63% | 40.06% |
| VoteNetRand | 10.00% | 5.28% |
| VoteNetBest | 55.10% | 54.33% |
| SCRC + backprojection | 18.70% | 6.45% |
| One-stage + backprojection | 20.38% | 9.04% |
| Ours (end-to-end, ScanRefer) | 41.19% | 27.40% |
Subsequent works such as 3DVG-T, InstanceRefer, BUTD-DETR, and EDA have advanced fully supervised state of the art to over 54% [email protected] and 42% [email protected]. Zero-shot methods such as SeeGround leverage 2D vision–LLMs to reach 44.1%/39.4% on [email protected]/0.5, outperforming prior zero-shot approaches by 7.7 percentage points at [email protected] (Li et al., 2024).
Standard Processing Pipeline
Practitioners typically:
- Downsample to 40,000 scene points.
- Project multi-view features or use raw RGB, normals, and height.
- Encode the referring expression via pretrained GloVe embeddings and a GRU (256D).
- Detect 3D proposals (e.g., M=256 via VoteNet).
- Fuse proposal features with language embedding via MLP and score with softmax.
- Apply 3D NMS and select top-1.
- Evaluate against ground-truth using designated metrics (Chen et al., 2019).
5. Extensions and Related Developments
Multi3DRefer
Multi3DRefer generalizes ScanRefer, addressing descriptions that refer to zero, single, or multiple target objects, thus reflecting more realistic and ambiguous reference scenarios. It encompasses 61,926 descriptions over the same 800 scenes and 11,609 objects, with:
- 6,688 zero-target descriptions
- 42,060 single-target
- 13,178 multiple-target (average 2 to 5.8 targets per prompt, max 32)
An F1-matching metric (using the Hungarian algorithm over pairwise IoU) is adopted for multi-object evaluation. Multi3DRefer also enriches vocabulary (unique word types: 7,077) and introduces controlled paraphrasing using ChatGPT for linguistic diversity (Zhang et al., 2023).
SeeGround and Zero-Shot Grounding
SeeGround exemplifies zero-shot approaches on ScanRefer, leveraging 2D vision–LLMs by representing 3D scenes as hybrid sets of rendered images combined with 3D-aware text prompts. This method achieves 39.4% [email protected] without any ScanRefer training signal (Li et al., 2024).
2D-3D Projected Tasks
Methods such as “Refer-it-in-RGBD” extract single-view RGB-D images from ScanRefer, enabling 2D-based visual grounding over projections, though such approaches are generally surpassed by direct 3D frameworks for disambiguation and localization performance (Liu et al., 2021).
6. Applications and Impact
ScanRefer is the canonical testbed for:
- Embodied AI research: language-guided robotic manipulation and navigation in human environments
- 3D visual grounding: disambiguating references to objects described verbally within real-world rooms
- Multi-modal understanding: aligning geometric, spatial, and linguistic information without explicit ontological constraints
- Benchmarking contrastive learning and transformer-based models for spatial language understanding
By providing diverse (in spatial relations, appearance, and semantic context) and richly annotated language–3D pairings, ScanRefer has enabled progress in open-vocabulary vision, improved 3D language grounding algorithms, and fostered the development of architectures supporting multi-reference and ambiguity-tolerant reasoning (Chen et al., 2019, Li et al., 2024, Zhang et al., 2023).