SpaceSpan: Unified 3D Spatial Intelligence Data
- SpaceSpan is a large-scale heterogeneous dataset integrating 3D proxy representations, annotated semantics, and language tasks for spatial reasoning.
- It employs semantic-aware clustering and proxy generation to compress raw visual-geometric data into efficient tokens for unified VLM training.
- The dataset supports diverse tasks including 3D QA, dense captioning, and grounding, achieving state-of-the-art performance with fewer tokens.
SpaceSpan is a large-scale, heterogeneous dataset developed to enable efficient and unified training of Vision-LLMs (VLMs) for 3D spatial intelligence. Its design supports the learning of “compressed” 3D world representations, bridging the gap between scene geometry, semantic understanding, and language-based reasoning. Unlike prior datasets that either rely on costly 2D-3D correspondence objectives or inefficient raw point cloud priors, SpaceSpan offers a unified corpus associating 3D proxy representations, annotated semantics, and a wide set of spatially grounded linguistic tasks (Jiang et al., 8 May 2026).
1. Motivation and Objectives
SpaceSpan was created to support VLMs in achieving effective 3D spatial reasoning by providing:
- A unified training corpus aligned with both vision and language modalities.
- A mechanism to learn from compact 3D proxy tokens, as opposed to extended 2D sequences or large uncompressed geometric representations.
- Diverse and heterogeneous data spanning numerous 3D scene understanding tasks, including question answering, grounding, and dense captioning. Previous methodologies for 3D-VLM training were limited by their dependence on computationally intensive feature matching (correspondence-based) or token inefficiency (fixed geometric priors), impacting scalability and generalization.
2. Data Collection and Composition
SpaceSpan consists of 318,000 scene–task pairs, divided as follows:
| Source Benchmark | Example Count | Tasks Covered |
|---|---|---|
| ScanQA, SQA3D, Scan2Cap, ScanRefer, Multi3DRefer | 155,000 | QA, captioning, grounding |
| MMScan, SR-91K (SpaceR) | 163,000 | Hierarchical relations/QAs |
Each example is derived from a uniformly sampled video clip (N = 32 RGB frames, ≈1 fps, 512×512 resolution) of indoor reconstructions, predominantly from ScanNet and ARKitScenes (~2,000 distinct rooms). Every scene provides up to 100 identifier tags and spans 213 object categories.
The raw per-example data includes:
- 32 RGB frames at 512×512 resolution.
- 3D point maps from the VGGT geometry predictor.
- Pixel-wise semantic masks from the SAM2 segmenter.
- Task-specific CAD-aligned 3D bounding boxes for grounding.
3. Annotation Schema and Verification
Each data point is annotated with:
- 3D proxy positions and feature vectors (computed as outlined in Section 5).
- Semantic labels for each visual patch (213 categories).
- Either question–answer pairs, dense captions, or referring expressions, related to the source benchmark's focus.
- Grounding boxes (axis-aligned in 3D) for relevant tasks.
Labeling pipelines leverage automated tools with high agreement rates:
- SAM2 segmentation masks yield ≥95% alignment with reference categories, verified on a held-out ScanNet subset.
- 3D point maps are scale-checked against ScanNet depth maps (>98% pass rate).
- QA and grounding annotations are directly from the original datasets; MMScan’s 115,000 new relational questions are double-checked for spatial validity.
4. Dataset Scale, Distribution, and Statistics
The SpaceSpan dataset is partitioned as follows: 90% for training (286,200 examples), 5% validation (15,900), and 5% testing (15,900). Scene-level statistics include:
- Average proxies per scene after clustering: (min 450, max 1,000).
- Task distribution: 3D QA (~100,000 pairs), dense captions (~20,000), visual grounding (~50,000), object–object relations (115,000).
- Category occurrence: head categories (e.g., chairs, tables) appear in ~60% of scenes, while rare categories (e.g., pepper shakers) in ~2%.
- Geometric variability: object distances (0.2m–5m), full angle coverage, room sizes (4–100 m).
5. Semantic-Aware Clustering and Proxy Generation
SpaceSpan employs a semantic-aware clustering pipeline to transform raw visual-geometric data into compact, sequence-serializable 3D proxies. The steps are:
- Feature Extraction:
- Semantic encoder (Qwen2.5-VL) generates feature maps per frame.
- Geometric predictor yields dense point maps 0; segmentation masks 1 from SAM2 provide pixel-level categories.
- Patchify and Grouping:
- Patch-flattening produces tokens 2, pooled 3D points 3, and semantic labels 4.
- Semantic grouping 5 for each category 6.
- Proxy Clustering:
- Each group 7 receives 8 proxies. K-means or KNN is performed on spatial coordinates 9, forming clusters 0.
- For each cluster, mean feature 1 and mean position 2 are computed.
- Position Embeddings and Serialization:
6. Integration in Multi-Stage Training
SpaceSpan provides the aligned vision-language supervision for four-stage progressive training:
- Identifier + Semantic Alignment: Synthetic “identifier” images and “semantic” icons are rendered for each object, features extracted, and used in identification tasks with cross-entropy losses.
- Coordinate Alignment: Tasks include explicit spatial localization of objects with 3D RoPE, penalizing localization errors by cross-entropy and L2 regression.
- Object-Object Relation Learning: Model is trained to encode and reason about spatial relations (e.g., left/right) between object pairs using binary cross-entropy.
- Full 3D Proxy Training: The complete Proxy3D tokens 6 are used to finetune the model on all downstream tasks including spatial QA, grounding, and captioning.
Training durations on 8×A6000 GPUs are: Stages 1–2 (~2 hours each), Stage 3 (3 hours), Stage 4 (55 hours).
7. Benchmark Results and Significance
Proxy3D models trained with SpaceSpan reach competitive or state-of-the-art performance on major 3D-VLM benchmarks employing only ~700 proxy tokens per scene (substantially fewer than prior approaches using 3,000–8,000 tokens):
| Task-Benchmark | SpaceSpan/Proxy3D Result | Comparison (3DRS, GPT4Scene) |
|---|---|---|
| 3D QA (ScanQA/SQA3D) | EM ↑93.6; ↑57.5% | 104.8; 60.6% |
| Visual Grounding (ScanRefer) | Acc 54.1%; F₁ 57.5% | 56.1%; 59.8% |
| Dense Captioning (Scan2Cap) | CIDEr 73.3; BLEU-4 34.7 | 86.1; 41.6 |
| VSI-Bench (spatial reason) | 47.0% (2nd open-source) | Human: 79.2% |
Notably, Proxy3D+SpaceSpan achieves substantial improvements in object counting and size estimation (+18 and +23 percentage points over Qwen2.5-VL). The dataset's unified support for proxy representation, semantic symbols, identifier embeddings, a broad range of spatial linguistic tasks, and object–object relations distinguishes it from previous resources, providing an integrated foundation for compact and efficient VLM-driven 3D scene understanding (Jiang et al., 8 May 2026).