SpatialLadder-26k: Progressive Spatial Reasoning
- The dataset provides a hierarchical curriculum supporting spatial reasoning from basic object localization to complex spatiotemporal tasks, achieving state-of-the-art accuracy.
- It is constructed via a three-stage pipeline—raw data collection, 3D-to-2D unification, and QA generation—that ensures high annotation quality and internal consistency.
- SpatialLadder-26k underpins applications in robotics, autonomous navigation, and AR/VR, while also setting the stage for expansions into outdoor and new modality domains.
SpatialLadder-26k is a systematic, multimodal dataset designed to support progressive spatial reasoning in vision-LLMs (VLMs). Consisting of 26,610 question–answer samples, it is constructed to provide a hierarchical curriculum ranging from foundational object localization to complex spatiotemporal reasoning. The dataset spans single-image, multi-view, and video modalities, and underpins the three-stage training paradigm for the SpatialLadder model, which achieves state-of-the-art accuracy on both in-domain and out-of-domain spatial reasoning benchmarks (Li et al., 9 Oct 2025).
1. Composition and Task Taxonomy
SpatialLadder-26k is organized into four modules, reflecting an ascending ladder of spatial reasoning complexity:
- Object Localization (5,929 samples): Single-image modality. The task consists of predicting object labels and 2D bounding boxes in response to a spatial query, providing perceptual grounding.
- Single-Image Spatial Reasoning (5,929 samples): Single-image tasks include absolute distance, object size, relative distance, and relative direction. These are posed as either numerical or multiple-choice questions, targeting foundational static spatial understanding.
- Multi-View Spatial Reasoning (5,752 samples): Eight-view panoramas from the same scene enable object counting, absolute/relative distance, object size, and relative direction tasks, demanding cross-view integration and implicit 3D reasoning.
- Video Spatial Reasoning (9,000 samples): Video clips (1–4 min, 24 fps) from SR-91k support the full spectrum of spatial reasoning—object counting, absolute distance, object size, relative distance, relative direction, room size, and appearance order (numerical/multiple-choice)—integrating temporal dynamics.
Together, these modules provide a comprehensive curriculum from raw perception to advanced spatiotemporal analytics.
Task Category Overview
| Category | Modality | Example Task Types |
|---|---|---|
| Localization | Single-image | 2D bounding-box grounding |
| Single-Image SR | Single-image | Distances, relative positions, sizes |
| Multi-View SR | Image panoramas | Counting, distances, directions |
| Video SR | Video | Motion, appearance order, room scale |
2. Dataset Construction Pipeline
SpatialLadder-26k was assembled via a standardized, three-stage methodology to guarantee coverage and annotation quality:
Stage A: Raw Data Collection
- Utilizes ScanNet 3D reconstructions for image and multi-view sets, with associated object metadata (3D bounding boxes, semantic labels, and positions).
- Incorporates 9,000 video samples from SR-91k, providing temporally coherent indoor scenes.
Stage B: 3D-to-2D Unification and Filtering
- Projects 3D bounding boxes to acquire 2D spatial representations.
- Extracts key object metadata: absolute 3D location, image-plane position, visibility ratio, and physical size.
- Implements filtering heuristics:
- Scene-diversity cap, object-type cap to prevent overfitting.
- Minimum visibility threshold (≥ 40%).
- Exclusion of wall, floor, ceiling categories to focus on human-scale objects.
- Uniqueness constraint ensures unambiguous object identity.
- These heuristics reduce noisy or ambiguous candidates, retaining ≈10% of initial samples.
Stage C: QA Generation via Templates and Computed Labels
- Adapts VSI-Bench templates to all seven spatial dimensions.
- Answers generated from unified metadata, including:
- Absolute distance:
- Object size: maximal side of 3D bounding box
- Relative distance: minimum ratio to ensure discriminability.
- Relative direction: computes angle using
which is discretized into spatial quadrants. - Counting: enumerates unique instances within views or time frames. - Room size: derived from 3D mesh extents. - Appearance order: determined by the temporal index of frame entry.
3. Annotation Protocols and Quality Control
Each entry in SpatialLadder-26k includes:
- The corresponding visual input (image, multi-view, video).
- An instantiated question template.
- Automatically computed ground-truth answer (numerical or categorical).
- For localization, precise 2D bounding boxes in JSON format.
Spatial relationships are annotated rigorously with 3D geometry and projected into the target modality. Automated consistency scripts ensure metadata-answer fidelity:
1 2 3 |
for each (scene, question, answer) in dataset: if compute_answer_from_metadata(question) != answer: discard sample |
4. Dataset Statistics and Benchmarking
SpatialLadder-26k is primarily intended as a training corpus without explicit public train/val/test splits; it is subdivided into four modules by task type. Sample sizes are:
| Module | Sample Count |
|---|---|
| Object Localization | 5,929 |
| Single-Image SR | 5,929 |
| Multi-View SR | 5,752 |
| Video SR | 9,000 |
| Total | 26,610 |
Videos are uniformly distributed in length (1–4 min, avg. ≈2.5 min); images/views cover the seven spatial dimensions. Downstream spatial reasoning benchmarks (SPBench-SI, SPBench-MV) utilize held-out ScanNet scenes with zero overlap, maintaining robust evaluation protocols.
5. Integration with Progressive Training and Model Performance
SpatialLadder-26k is integral to the three-stage progressive training methodology:
- Stage 1: Perceptual Grounding – Supervised fine-tuning with 5,929 localization samples using cross-entropy loss over JSON token sequences.
- Stage 2: Spatial Understanding – Supervised fine-tuning with all spatial reasoning samples (single-image, multi-view, video); cross-entropy loss for both multiple-choice and numerical QA.
- Stage 3: Complex Reasoning – Reinforcement learning via GRPO and chain-of-thought prompting, using all SpatialLadder-26k samples and 1,255 cold-start examples. Rewards comprise formatting and accuracy components (binary for MCQ, graduated for numerical).
Model attention is analyzed using:
- Visual Attention IoU:
where indexes tokens inside ground-truth bounding boxes.
- Attention Entropy:
SpatialLadder-3B, trained with this curriculum, achieves a 62.3% overall accuracy (in-domain), improving upon the base model by 23.4 pp, exceeding GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Out-of-domain performance rises by 7.2 pp, demonstrating generalization (Li et al., 9 Oct 2025). Attention metrics confirm task-oriented model focus: IoU increases by 12% and entropy decreases by 9%.
6. Applications, Limitations, and Prospective Extensions
SpatialLadder-26k’s multimodal curriculum supports varied downstream tasks including:
- Robotic navigation and manipulation (indoor environments)
- Autonomous driving (object detection, spatial estimation)
- Augmented/virtual reality (room-scale spatial anchoring)
- 3D scene understanding and reconstruction
Current limitations include:
- Scope: The 26,610 samples, although highly curated, are modest compared to larger web datasets, potentially limiting coverage and scalability.
- Domain and Modality: Focus is on indoor ScanNet scenes, introducing bias, and omitting modalities such as LiDAR, point clouds, or outdoor imagery.
Planned extensions involve scaling to outdoor 3D scenes, integrating new modalities (e.g., LiDAR, ultrasound), and broadening the scenario diversity. Adaptive curricula that reweight task sampling relative to proficiency () are identified as promising advancements.
The SpatialLadder-26k dataset constitutes a structured, high-quality multimodal resource, enabling robust spatial reasoning and offering a foundation for future research and development in spatially aware VLMs (Li et al., 9 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free