SpaceSpan: Unified 3D Spatial Intelligence Data

Updated 31 May 2026

SpaceSpan is a large-scale heterogeneous dataset integrating 3D proxy representations, annotated semantics, and language tasks for spatial reasoning.
It employs semantic-aware clustering and proxy generation to compress raw visual-geometric data into efficient tokens for unified VLM training.
The dataset supports diverse tasks including 3D QA, dense captioning, and grounding, achieving state-of-the-art performance with fewer tokens.

SpaceSpan is a large-scale, heterogeneous dataset developed to enable efficient and unified training of Vision-LLMs (VLMs) for 3D spatial intelligence. Its design supports the learning of “compressed” 3D world representations, bridging the gap between scene geometry, semantic understanding, and language-based reasoning. Unlike prior datasets that either rely on costly 2D-3D correspondence objectives or inefficient raw point cloud priors, SpaceSpan offers a unified corpus associating 3D proxy representations, annotated semantics, and a wide set of spatially grounded linguistic tasks (Jiang et al., 8 May 2026).

1. Motivation and Objectives

SpaceSpan was created to support VLMs in achieving effective 3D spatial reasoning by providing:

A unified training corpus aligned with both vision and language modalities.
A mechanism to learn from compact 3D proxy tokens, as opposed to extended 2D sequences or large uncompressed geometric representations.
Diverse and heterogeneous data spanning numerous 3D scene understanding tasks, including question answering, grounding, and dense captioning. Previous methodologies for 3D-VLM training were limited by their dependence on computationally intensive feature matching (correspondence-based) or token inefficiency (fixed geometric priors), impacting scalability and generalization.

2. Data Collection and Composition

SpaceSpan consists of 318,000 scene–task pairs, divided as follows:

Source Benchmark	Example Count	Tasks Covered
ScanQA, SQA3D, Scan2Cap, ScanRefer, Multi3DRefer	155,000	QA, captioning, grounding
MMScan, SR-91K (SpaceR)	163,000	Hierarchical relations/QAs

Each example is derived from a uniformly sampled video clip (N = 32 RGB frames, ≈1 fps, 512×512 resolution) of indoor reconstructions, predominantly from ScanNet and ARKitScenes (~2,000 distinct rooms). Every scene provides up to 100 identifier tags and spans 213 object categories.

The raw per-example data includes:

32 RGB frames at 512×512 resolution.
3D point maps $P_i \in \mathbb{R}^{H \times W \times 3}$ from the VGGT geometry predictor.
Pixel-wise semantic masks $M_i \in \mathbb{Z}^{H \times W}$ from the SAM2 segmenter.
Task-specific CAD-aligned 3D bounding boxes for grounding.

3. Annotation Schema and Verification

Each data point is annotated with:

3D proxy positions and feature vectors (computed as outlined in Section 5).
Semantic labels $m_j$ for each visual patch (213 categories).
Either question–answer pairs, dense captions, or referring expressions, related to the source benchmark's focus.
Grounding boxes (axis-aligned in 3D) for relevant tasks.

Labeling pipelines leverage automated tools with high agreement rates:

SAM2 segmentation masks yield ≥95% alignment with reference categories, verified on a held-out ScanNet subset.
3D point maps $P_i$ are scale-checked against ScanNet depth maps (>98% pass rate).
QA and grounding annotations are directly from the original datasets; MMScan’s 115,000 new relational questions are double-checked for spatial validity.

4. Dataset Scale, Distribution, and Statistics

The SpaceSpan dataset is partitioned as follows: 90% for training (286,200 examples), 5% validation (15,900), and 5% testing (15,900). Scene-level statistics include:

Average proxies per scene after clustering: $K \approx 700$ (min 450, max 1,000).
Task distribution: 3D QA (~100,000 pairs), dense captions (~20,000), visual grounding (~50,000), object–object relations (115,000).
Category occurrence: head categories (e.g., chairs, tables) appear in ~60% of scenes, while rare categories (e.g., pepper shakers) in ~2%.
Geometric variability: object distances (0.2m–5m), full $360^\circ$ angle coverage, room sizes (4–100 m $^2$ ).

5. Semantic-Aware Clustering and Proxy Generation

SpaceSpan employs a semantic-aware clustering pipeline to transform raw visual-geometric data into compact, sequence-serializable 3D proxies. The steps are:

Feature Extraction:
- Semantic encoder $G_{sem}$ (Qwen2.5-VL) generates feature maps $F_i \in \mathbb{R}^{28 \times 28 \times 1024}$ per frame.
- Geometric predictor $G_{geom}$ yields dense point maps $M_i \in \mathbb{Z}^{H \times W}$ 0; segmentation masks $M_i \in \mathbb{Z}^{H \times W}$ 1 from SAM2 provide pixel-level categories.
Patchify and Grouping:
- Patch-flattening produces tokens $M_i \in \mathbb{Z}^{H \times W}$ 2, pooled 3D points $M_i \in \mathbb{Z}^{H \times W}$ 3, and semantic labels $M_i \in \mathbb{Z}^{H \times W}$ 4.
- Semantic grouping $M_i \in \mathbb{Z}^{H \times W}$ 5 for each category $M_i \in \mathbb{Z}^{H \times W}$ 6.
Proxy Clustering:
- Each group $M_i \in \mathbb{Z}^{H \times W}$ 7 receives $M_i \in \mathbb{Z}^{H \times W}$ 8 proxies. K-means or KNN is performed on spatial coordinates $M_i \in \mathbb{Z}^{H \times W}$ 9, forming clusters $m_j$ 0.
- For each cluster, mean feature $m_j$ 1 and mean position $m_j$ 2 are computed.
Position Embeddings and Serialization:
- 3D RoPE is applied to the vertical coordinate and learned Fourier embeddings to $m_j$ 3.
- Proxies are serialized by BFS over their $m_j$ 4 adjacency graph, yielding a compact feature sequence $m_j$ 5.

6. Integration in Multi-Stage Training

SpaceSpan provides the aligned vision-language supervision for four-stage progressive training:

Identifier + Semantic Alignment: Synthetic “identifier” images and “semantic” icons are rendered for each object, features extracted, and used in identification tasks with cross-entropy losses.
Coordinate Alignment: Tasks include explicit spatial localization of objects with 3D RoPE, penalizing localization errors by cross-entropy and L2 regression.
Object-Object Relation Learning: Model is trained to encode and reason about spatial relations (e.g., left/right) between object pairs using binary cross-entropy.
Full 3D Proxy Training: The complete Proxy3D tokens $m_j$ 6 are used to finetune the model on all downstream tasks including spatial QA, grounding, and captioning.

Training durations on 8×A6000 GPUs are: Stages 1–2 (~2 hours each), Stage 3 (3 hours), Stage 4 (55 hours).

7. Benchmark Results and Significance

Proxy3D models trained with SpaceSpan reach competitive or state-of-the-art performance on major 3D-VLM benchmarks employing only ~700 proxy tokens per scene (substantially fewer than prior approaches using 3,000–8,000 tokens):

Task-Benchmark	SpaceSpan/Proxy3D Result	Comparison (3DRS, GPT4Scene)
3D QA (ScanQA/SQA3D)	EM ↑93.6; ↑57.5%	104.8; 60.6%
Visual Grounding (ScanRefer)	Acc 54.1%; F₁ 57.5%	56.1%; 59.8%
Dense Captioning (Scan2Cap)	CIDEr 73.3; BLEU-4 34.7	86.1; 41.6
VSI-Bench (spatial reason)	47.0% (2nd open-source)	Human: 79.2%

Notably, Proxy3D+SpaceSpan achieves substantial improvements in object counting and size estimation (+18 and +23 percentage points over Qwen2.5-VL). The dataset's unified support for proxy representation, semantic symbols, identifier embeddings, a broad range of spatial linguistic tasks, and object–object relations distinguishes it from previous resources, providing an integrated foundation for compact and efficient VLM-driven 3D scene understanding (Jiang et al., 8 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpaceSpan Dataset.