Papers
Topics
Authors
Recent
2000 character limit reached

SpatialQA Dataset: Depth-Aware Spatial Reasoning

Updated 19 November 2025
  • SpatialQA is a multi-level dataset combining diverse RGB-D imagery with detailed QA pairs to evaluate depth perception and spatial reasoning in vision–language models.
  • It integrates data from multiple sources using both sensor and estimated depths, with automated prompting and manual verification to ensure high-quality annotations.
  • The dataset categorizes queries into absolute depth, relative depth, and complex spatial relationships, enabling precise benchmarking of model spatial competence.

SpatialQA is a multi-level dataset designed for training and evaluating vision–LLMs (VLMs) on depth understanding and spatial reasoning, with a special emphasis on depth-aware question answering. Developed as part of the SpatialBot framework, it seeks to benchmark fine-grained spatial comprehension—from pixel-accurate depth queries to higher-order 3D spatial relationships—by combining diverse RGB–depth data sources with systematically generated QA pairs. SpatialQA is distinct within the landscape of spatial QA datasets for its integration of multi-source RGB–D imagery, depth-driven QA annotation, and hierarchical question taxonomy, facilitating granular assessment of VLMs’ spatial competence (Cai et al., 19 Jun 2024).

1. Composition and Data Sources

SpatialQA aggregates RGB–D images and QA pairs from seven primary sources:

  • Bunny_695k (COCO + Visual Genome [VG]) with ≈ 695,000 images (MDE-estimated depth)
  • VG/COCO subset: 20,000 images (ZoeDepth)
  • KITTI: 1,750 images (filtered, ZoeDepth)
  • NYU Depth v2: 1,500 images (sensor depth)
  • RT-X: 7,500 images (sensor or ZoeDepth)
  • SA-1B: 15,000 images (ZoeDepth)
  • 2D-3D-S: 2,900 images (sensor depth)

This yields approximately 743,000–750,000 unique RGB–D images. Each image is paired with a channel-aligned depth map either as a single-channel uint24 (millimeter-quantized, up to 131 m range) or three-channel uint8 PNG (bases 2⁰, 2⁵, 2¹⁰ mm). All images support CLIP-compatible shape (384×384 or 336×336 px).

Correspondingly, the dataset provides ≈ 740,000–750,000 QA pairs. Each image features 1–3 QA turns targeting different spatial reasoning levels, using both inherited questions from Bunny_695k and new, depth-focused questions. The dataset is partitioned using a 90/5/5 percent split for training, validation, and test, though no explicit held-out “test” set is detailed for SpatialQA in the primary publication.

2. Taxonomy of Spatial Questions

SpatialQA’s QA pairs are categorized into three levels, each designed to elicit different reasoning capabilities:

  1. Level 1: Absolute Depth
    • Queries require pixel- or object-centered depth value estimation from the underlying depth map.
    • Example: “What is the depth at pixel (200,150)?” → “Depth(1.23 m)”
  2. Level 2: Relative Depth / Proximity
    • Questions target comparative judgments (e.g., between two objects), requiring Boolean or categorical responses.
    • Example: “Is the red mug or the blue pitcher closer to the camera?” → “The red mug is closer.”
  3. Level 3: 3D Spatial Relationships and Counting
    • These capture higher-order object relationships and cardinality constraints, such as “Has A touched B?” or counting the number of objects satisfying a predicate.
    • Example: “Has the green block reached the red block?” → “No – there is still a 5 cm gap.”

The annotation process leverages both automated prompting (GPT-4o) and manual verification, combining depth-map analysis, object-level segmentation (using SAM or bounding-box+center cues), and spatial language templates. Percentile statistics within object masks are used for statistical depth description (e.g., min = 5th percentile, max = 95th percentile, mean).

3. Annotation and QA Generation

SpatialQA’s generation pipeline comprises:

  • Depth channel population: Sensor depth is used when available; otherwise, ZoeDepth MDE is employed.
  • Segmentation: For objects, a stable mask is generated using SAM. If this fails, a fallback to center-pixel depth is applied.
  • Prompted QA Synthesis: For ≈ 50,000 images, depth-related prompts are issued to GPT-4o, with templates covering low-level metric reading, proximity comparison, and robot scene understanding.
  • Human-in-the-loop QA and filtering: Manual annotation and spot checks remove repetitive or ambiguous examples. In particular, the RT-X subset features manual bounding-box and gripper annotations for robotic grasping scenarios, and all VG/COCO entries are verified against RGB content.
  • Quality assurance: No explicit inter-annotator agreement figure is reported. Quality is instead policed through manual spot checks and SAM confidence thresholds.

A JSONL schema is employed per image, encapsulating RGB and depth paths alongside all QA turns.

4. Evaluation Metrics and Benchmarking Protocols

Two key classes of metrics are defined:

  • Depth Estimation:

    • Mean Absolute Error (MAE):

    MAE=1Ni=1Ndigtdipred\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^N \lvert d_i^{\mathrm{gt}} - d_i^{\mathrm{pred}} \rvert - Depth Accuracy @10%: Proportion of depth queries whose predictions fall within ±10% of ground truth.

  • QA Accuracy:

    • Top-1 Accuracy: Fraction of relative/spatial QA where the model's best answer matches ground truth.

    Acc=# correct answerstotal questions×100%\mathrm{Acc} = \frac{\text{\# correct answers}}{\text{total questions}} \times 100\% - On the SpatialBench suite (120 images), accuracy is further decomposed into “Depth,” “Position,” “Counting,” “Reaching,” and “Size” categories.

SpatialQA’s evaluation emphasizes alignment with precise depth and spatial reference, challenging VLMs to anchor language understanding directly in metric 3D structure.

5. Data Formats, Accessibility, and Usage

  • Image and Depth: Each instance comprises an RGB image (.jpeg/.png) and a depth map (.uint24-single-channel or three-channel uint8).
  • QA Storage: Each image’s QA is stored in a .jsonl format.
  • Schema Example:

1
2
3
4
5
6
7
8
9
10
{
  "image_id": "VG_0000123456",
  "rgb_path": "images/VG/VG_0000123456.jpg",
  "depth_path": "depths/VG/VG_0000123456_d.png",
  "qa": [
    {"question": "What is the depth at pixel (200,150)?", "answer": "1.23 m", "level": "absolute"},
    {"question": "Which object is closer: the red mug or the blue pitcher?", "answer": "the red mug", "level": "relative"},
    {"question": "Has the red mug touched the white plate?", "answer": "No, it is still 3 cm above.", "level": "spatial"}
  ]
}

All data and code are released under a CC-BY-NC 4.0 license via Hugging Face and the BAAI-DCAI GitHub repository. The dataset suite includes associated variants for embodiment (SpatialQA-E) and a dedicated evaluation suite (SpatialBench) (Cai et al., 19 Jun 2024).

6. Comparative Context and Position in the SpatialQA Ecosystem

SpatialQA distinguishes itself from other spatial QA datasets in several dimensions:

  • Multi-source, depth-augmented RGB data: Integrates MDE and sensor-derived depths from a broad range of domains (indoor, driving, robotic, synthetic).
  • Hierarchical QA taxonomy: Explicit structuring of queries from absolute metric estimation through relational and high-level comprehension.
  • Focus on depth-grounded spatial language: Unlike object-centric or semantic QA datasets (e.g., ScanQA (Azuma et al., 2021), Space3D-Bench (Szymanska et al., 29 Aug 2024)), SpatialQA directly supervises depth map grounding.
  • Direct evaluation of depth understanding: Challenging VLMs to go beyond 2D semantics to make metric claims about 3D structure.

These attributes facilitate targeted diagnostics for spatial reasoning modules in general VLMs and embodied AI architectures.

7. Limitations and Future Directions

Limitations of SpatialQA include:

  • The lack of explicit inter-annotator agreement reporting limits quantification of annotation noise.
  • The absence of highly free-form, real-user queries constrains assessment of generalization to open-world spatial reasoning.
  • Depth estimation relies on MDE where sensors are unavailable, introducing possible model-driven bias into the data.
  • The taxonomy encompasses only three coarse levels; future work could introduce finer subcategories or spatio-temporal reasoning (as in STRIDE-QA (Ishihara et al., 14 Aug 2025)).

Prospective extensions involve integrating ground-truth sensor suites for 3D structure, expanding dialogue-like QA, and enriching the open-endedness and spatial variety of queries and environments.


SpatialQA plays a pivotal role in benchmarking and improving depth-aware spatial reasoning in vision–LLMs, providing a densely annotated, multi-level corpus with clear evaluation protocols and open-access availability (Cai et al., 19 Jun 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SpatialQA Dataset.