Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpaceSpan: Unified 3D Spatial Intelligence Data

Updated 31 May 2026
  • SpaceSpan is a large-scale heterogeneous dataset integrating 3D proxy representations, annotated semantics, and language tasks for spatial reasoning.
  • It employs semantic-aware clustering and proxy generation to compress raw visual-geometric data into efficient tokens for unified VLM training.
  • The dataset supports diverse tasks including 3D QA, dense captioning, and grounding, achieving state-of-the-art performance with fewer tokens.

SpaceSpan is a large-scale, heterogeneous dataset developed to enable efficient and unified training of Vision-LLMs (VLMs) for 3D spatial intelligence. Its design supports the learning of “compressed” 3D world representations, bridging the gap between scene geometry, semantic understanding, and language-based reasoning. Unlike prior datasets that either rely on costly 2D-3D correspondence objectives or inefficient raw point cloud priors, SpaceSpan offers a unified corpus associating 3D proxy representations, annotated semantics, and a wide set of spatially grounded linguistic tasks (Jiang et al., 8 May 2026).

1. Motivation and Objectives

SpaceSpan was created to support VLMs in achieving effective 3D spatial reasoning by providing:

  • A unified training corpus aligned with both vision and language modalities.
  • A mechanism to learn from compact 3D proxy tokens, as opposed to extended 2D sequences or large uncompressed geometric representations.
  • Diverse and heterogeneous data spanning numerous 3D scene understanding tasks, including question answering, grounding, and dense captioning. Previous methodologies for 3D-VLM training were limited by their dependence on computationally intensive feature matching (correspondence-based) or token inefficiency (fixed geometric priors), impacting scalability and generalization.

2. Data Collection and Composition

SpaceSpan consists of 318,000 scene–task pairs, divided as follows:

Source Benchmark Example Count Tasks Covered
ScanQA, SQA3D, Scan2Cap, ScanRefer, Multi3DRefer 155,000 QA, captioning, grounding
MMScan, SR-91K (SpaceR) 163,000 Hierarchical relations/QAs

Each example is derived from a uniformly sampled video clip (N = 32 RGB frames, ≈1 fps, 512×512 resolution) of indoor reconstructions, predominantly from ScanNet and ARKitScenes (~2,000 distinct rooms). Every scene provides up to 100 identifier tags and spans 213 object categories.

The raw per-example data includes:

  • 32 RGB frames at 512×512 resolution.
  • 3D point maps PiRH×W×3P_i \in \mathbb{R}^{H \times W \times 3} from the VGGT geometry predictor.
  • Pixel-wise semantic masks MiZH×WM_i \in \mathbb{Z}^{H \times W} from the SAM2 segmenter.
  • Task-specific CAD-aligned 3D bounding boxes for grounding.

3. Annotation Schema and Verification

Each data point is annotated with:

  1. 3D proxy positions and feature vectors (computed as outlined in Section 5).
  2. Semantic labels mjm_j for each visual patch (213 categories).
  3. Either question–answer pairs, dense captions, or referring expressions, related to the source benchmark's focus.
  4. Grounding boxes (axis-aligned in 3D) for relevant tasks.

Labeling pipelines leverage automated tools with high agreement rates:

  • SAM2 segmentation masks yield ≥95% alignment with reference categories, verified on a held-out ScanNet subset.
  • 3D point maps PiP_i are scale-checked against ScanNet depth maps (>98% pass rate).
  • QA and grounding annotations are directly from the original datasets; MMScan’s 115,000 new relational questions are double-checked for spatial validity.

4. Dataset Scale, Distribution, and Statistics

The SpaceSpan dataset is partitioned as follows: 90% for training (286,200 examples), 5% validation (15,900), and 5% testing (15,900). Scene-level statistics include:

  • Average proxies per scene after clustering: K700K \approx 700 (min 450, max 1,000).
  • Task distribution: 3D QA (~100,000 pairs), dense captions (~20,000), visual grounding (~50,000), object–object relations (115,000).
  • Category occurrence: head categories (e.g., chairs, tables) appear in ~60% of scenes, while rare categories (e.g., pepper shakers) in ~2%.
  • Geometric variability: object distances (0.2m–5m), full 360360^\circ angle coverage, room sizes (4–100 m2^2).

5. Semantic-Aware Clustering and Proxy Generation

SpaceSpan employs a semantic-aware clustering pipeline to transform raw visual-geometric data into compact, sequence-serializable 3D proxies. The steps are:

  • Feature Extraction:
    • Semantic encoder GsemG_{sem} (Qwen2.5-VL) generates feature maps FiR28×28×1024F_i \in \mathbb{R}^{28 \times 28 \times 1024} per frame.
    • Geometric predictor GgeomG_{geom} yields dense point maps MiZH×WM_i \in \mathbb{Z}^{H \times W}0; segmentation masks MiZH×WM_i \in \mathbb{Z}^{H \times W}1 from SAM2 provide pixel-level categories.
  • Patchify and Grouping:
    • Patch-flattening produces tokens MiZH×WM_i \in \mathbb{Z}^{H \times W}2, pooled 3D points MiZH×WM_i \in \mathbb{Z}^{H \times W}3, and semantic labels MiZH×WM_i \in \mathbb{Z}^{H \times W}4.
    • Semantic grouping MiZH×WM_i \in \mathbb{Z}^{H \times W}5 for each category MiZH×WM_i \in \mathbb{Z}^{H \times W}6.
  • Proxy Clustering:
    • Each group MiZH×WM_i \in \mathbb{Z}^{H \times W}7 receives MiZH×WM_i \in \mathbb{Z}^{H \times W}8 proxies. K-means or KNN is performed on spatial coordinates MiZH×WM_i \in \mathbb{Z}^{H \times W}9, forming clusters mjm_j0.
    • For each cluster, mean feature mjm_j1 and mean position mjm_j2 are computed.
  • Position Embeddings and Serialization:
    • 3D RoPE is applied to the vertical coordinate and learned Fourier embeddings to mjm_j3.
    • Proxies are serialized by BFS over their mjm_j4 adjacency graph, yielding a compact feature sequence mjm_j5.

6. Integration in Multi-Stage Training

SpaceSpan provides the aligned vision-language supervision for four-stage progressive training:

  1. Identifier + Semantic Alignment: Synthetic “identifier” images and “semantic” icons are rendered for each object, features extracted, and used in identification tasks with cross-entropy losses.
  2. Coordinate Alignment: Tasks include explicit spatial localization of objects with 3D RoPE, penalizing localization errors by cross-entropy and L2 regression.
  3. Object-Object Relation Learning: Model is trained to encode and reason about spatial relations (e.g., left/right) between object pairs using binary cross-entropy.
  4. Full 3D Proxy Training: The complete Proxy3D tokens mjm_j6 are used to finetune the model on all downstream tasks including spatial QA, grounding, and captioning.

Training durations on 8×A6000 GPUs are: Stages 1–2 (~2 hours each), Stage 3 (3 hours), Stage 4 (55 hours).

7. Benchmark Results and Significance

Proxy3D models trained with SpaceSpan reach competitive or state-of-the-art performance on major 3D-VLM benchmarks employing only ~700 proxy tokens per scene (substantially fewer than prior approaches using 3,000–8,000 tokens):

Task-Benchmark SpaceSpan/Proxy3D Result Comparison (3DRS, GPT4Scene)
3D QA (ScanQA/SQA3D) EM ↑93.6; ↑57.5% 104.8; 60.6%
Visual Grounding (ScanRefer) Acc 54.1%; F₁ 57.5% 56.1%; 59.8%
Dense Captioning (Scan2Cap) CIDEr 73.3; BLEU-4 34.7 86.1; 41.6
VSI-Bench (spatial reason) 47.0% (2nd open-source) Human: 79.2%

Notably, Proxy3D+SpaceSpan achieves substantial improvements in object counting and size estimation (+18 and +23 percentage points over Qwen2.5-VL). The dataset's unified support for proxy representation, semantic symbols, identifier embeddings, a broad range of spatial linguistic tasks, and object–object relations distinguishes it from previous resources, providing an integrated foundation for compact and efficient VLM-driven 3D scene understanding (Jiang et al., 8 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpaceSpan Dataset.