SpaceVista-1M: All-Scale Spatial Reasoning
- SpaceVista-1M is a large-scale, video-based spatial reasoning dataset that covers diverse scales from millimeters to kilometers.
- It uses a specialist-driven automated pipeline integrating depth estimation, tracking, and detection to generate 1M QA pairs from 38K scenes.
- The dataset underpins training for models like SpaceVista-7B, improving spatial understanding through scale-aware experts and reinforcement learning.
Searching arXiv for the primary paper and closely related spatial reasoning datasets mentioned in the source material. SpaceVista-1M is a large-scale, video-based spatial reasoning dataset introduced in the SpaceVista framework to endow multimodal LLMs with “all-scale” visual spatial reasoning, from millimeter-level tabletop manipulation to drone-view and city-scale reasoning (Sun et al., 10 Oct 2025). It comprises roughly 1 million QA pairs over about 38,000 video scenes, spans 19 task types, and was created to address two limitations of earlier spatial reasoning resources: heavy reliance on indoor 3D scans and labor-intensive manual annotations, and the absence of effective all-scale scene modeling, which tends to produce overfitting to individual scenes. The dataset is designed primarily for training and reinforcement learning of spatial MLLMs rather than for evaluation; evaluation is delegated to a separate manually grounded benchmark, SpaceVista-Bench (Sun et al., 10 Oct 2025).
1. Scope, motivation, and design objective
SpaceVista-1M was created in response to persistent scale limitation, indoor bias, annotation cost, and narrow task coverage in prior spatial reasoning datasets. Earlier resources were mostly built from indoor 3D scan data such as ScanNet or ScanNet++ and concentrated on room-scale scenes, roughly 1–30 m, with weak coverage of tiny / manipulation-scale scenes, large outdoor / street / city / drone scales, and drone or aerial viewpoints. They also relied on labor-intensive 3D annotations and were often restricted to a small number of task families such as counting, simple distance estimation, and basic relative-position judgments (Sun et al., 10 Oct 2025).
The stated goal of SpaceVista-1M is therefore not merely dataset scaling in the conventional sense, but the construction of a holistic, all-scale, spatial reasoning resource. In the reported formulation, the dataset covers real-world videos from tiny object scale (mm) up to drone-view and urban scale (≈km), provides structured 2D and 3D spatial supervision such as depth, camera poses, and object masks or boxes, includes 19 spatial tasks, and is built through a task-specific, specialist-driven automated pipeline. A plausible implication is that the dataset is intended as a substrate for learning spatial abstractions that are stable across radically different metric regimes rather than only across semantic categories.
2. Scale composition and scene domains
The dataset contains about 38K scenes from diverse sources and a reported total of approximately 1,014K QA pairs. The paper frames the corpus as spanning 5 spatial scales, while the detailed scale statistics report QA counts for Tiny Tabletop, Tabletop, Indoor, Wild Indoor, Outdoor, and Drone-View. The corresponding physical ranges extend from to , effectively covering six orders of magnitude (Sun et al., 10 Oct 2025).
| Scale | QA pairs | Physical range |
|---|---|---|
| Tiny Tabletop | 79K | |
| Tabletop | 242K | |
| Indoor | 162.5K | |
| Wild Indoor | 213.3K | |
| Outdoor | 284.3K | |
| Drone-View | 33.1K |
These scales are associated with distinct scene types and application domains. Tiny tabletop scenes, sourced from uCO3D and self-collected videos, include screws, tiny mechanical parts, and other very small objects, often in 360° object-centric videos, with applications in precision manufacturing, fine manipulation, and potentially surgery or medical scenarios in the future. Tabletop scenes, from WildRGB-D, SMOT, and self-collected videos, depict kitchens, desks, beds, or floors with cluttered everyday objects and are directly linked to robotic manipulation, grasp planning, and rearrangement. Indoor scenes are structured and scan-based, using sources such as ScanNet, ScanNet++, SpaceR, and SPAR, and are tied to indoor navigation and embodied agents. Wild Indoor scenes derive from the indoor subset of DL3DV and emphasize more complex geometry, reflective or translucent surfaces, and public indoor environments. Outdoor scenes cover streets, parks, landmarks, plazas, and courtyards and are motivated by autonomous driving, pedestrian navigation, and urban understanding. Drone-View scenes, drawn from the drone subset of DL3DV, provide aerial perspectives for drone navigation, aerial mapping, and remote-sensing-style tasks (Sun et al., 10 Oct 2025).
On the video side, raw source data covers over 100 million frames, while rich annotations are computed on about 10M selected frames. The reported video resolution ranges from 480p to 2.7K, usually at 24–30 fps. Training uses up to 32 frames per video at 128×28×28 tokenized resolution, while evaluation often uses higher resolution. These details matter because the dataset is explicitly video-centric: temporal continuity is part of the supervision regime rather than an auxiliary convenience.
3. Task taxonomy and supervision schema
SpaceVista-1M is organized around 19 spatial reasoning tasks, grouped into cross-scale and scale-specific categories (Sun et al., 10 Oct 2025).
General cross-scale tasks include Position Comparison, Size Comparison, Existence Estimation, Object Counting, Rotation Estimation, Relative Distance, Absolute Distance, Object Size, Route Plan, Appearance Order, Depth Estimation, View Change Inference (Camera Moving), Object Matching, and Spatial Relation. Collectively, these tasks probe symbolic spatial relations, metric estimation, temporal geometry, identity persistence, ego-motion reasoning, and 3D structural relations such as support, stacking, hanging, adhesion, encircling, and plug-in.
Scale-specific tasks add global or application-specific reasoning. Room Size is specific to indoor scenes and requires estimation of room area or volume. Navigation is specific to outdoor scenes and extends route planning across streets and parks under multiple viewpoints. Area Estimation is specific to drone-view scenes and targets world-area estimation of roofs or fields from aerial perspective. Obstacles Location and Manipulation Planning are tabletop-specific and are formulated around collision-aware motion planning; for the latter, planning is computed via RRT (Rapidly-exploring Random Tree) in 3D and then discretized into linguistic actions (Sun et al., 10 Oct 2025).
The annotation schema is correspondingly heterogeneous. Each QA entry includes a video clip, optional visual grounding in the form of keyframe masks, bounding boxes, or points, and a question that may be template-generated or GPT-generated. The paper reports about 3,000 templates to diversify structured prompts and reduce language shortcuts. Answers are provided in three formats: free-form natural language, multiple choice, and regression. Free-form answers are used for SFT with chain-of-thought rationales; multiple-choice and regression targets are used for RL and accuracy evaluation. CoT rationales are generated with Qwen2.5-VL-72B and related LLMs and then filtered for consistency. For RL, the rationales are removed and replaced at training time by structured anchors such as <semantic>, <scale>, and <answer>. A notable property of the corpus is that each “semantic unit” is counted only once in the 1M QA total, producing a QA/Scene ratio ~25; this suggests a design that emphasizes scene diversity rather than exhaustive question proliferation on a small set of reconstructed environments.
4. Specialist-driven automated construction pipeline
SpaceVista-1M is built with an automated, task-specific pipeline that integrates specialist models for depth estimation, geometry, detection, segmentation, tracking, and language generation (Sun et al., 10 Oct 2025). Source selection prioritizes video datasets with known camera intrinsics/extrinsics and or precomputed 3D models, including uCO3D, WildRGB-D, SMOT, ScanNet, ScanNet++, DL3DV, and self-collected videos. A key design choice is to avoid estimating camera poses from scratch where possible, thereby reducing geometric drift.
Metric and temporally consistent depth are central. The reported depth stack uses Metric3Dv2, UniDepthV2, and Video-Depth-Anything. Video-consistent depth is obtained by minimizing
where is per-frame metric depth, 0 is Video-Depth-Anything depth, and 1 denotes the temporal gradient. This objective enforces temporal smoothness while retaining metric grounding.
Semantic detection and tracking are handled by DINO-X and Grounding DINO for open-vocabulary detection, and SAM 2 in combination with Grounding DINO through Grounded-SAM2 for instance masks and tracking. The resulting per-frame annotations include object categories, bounding boxes, masks, temporal IDs, depth, and camera parameters. With depth maps and camera calibration, pixels are projected into a canonical camera space, defined as a 3D Cartesian system centered at the camera optical center. For each segmented instance, the pipeline computes a 3D point cloud, applies PCA to estimate principal axes, size, and centroid, and derives distances between centroids; room and area estimates are obtained by analyzing planar structures. For distance and size tasks, only objects with masks of at least 20×20 pixels are retained.
Representative task workflows are explicitly specified. In outdoor Counting, open-vocabulary detection is run with a threshold such as conf ≥ 0.3, detections are projected into 3D and tracked across frames, and instances are filtered by box size ≥ 32 px, object count such as 2–10, and track consistency over ≥ 10 frames. Tabletop counting uses Grounding DINO boxes >0.4 and SAM2 propagation with IoU and center-distance thresholds of 0.4 and 32 px. Distance and size tasks use back-projected depth inside object masks and PCA-derived object extent. Planning, Navigation, and Manipulation use 3D representations and bounding boxes as obstacles and compute collision-free paths through RRT in world coordinates before converting them to discrete language actions through an LLM. Spatial Relation is derived from candidate object pairs selected by proximity and vertical arrangement, followed by an LLM decision over predefined relation types using 3D features and semantics under few-shot CoT prompting. Depth targets are taken directly from metric depth in canonical space, and camera motion is computed from extrinsics as the translation direction between frames.
5. Reliability, noise, and benchmark separation
The dataset is explicitly not treated as a reliable evaluation source because its annotations are largely generated by specialist models and automated procedures (Sun et al., 10 Oct 2025). Two issues are emphasized. The first is specialist model noise: metric depth, 3D reconstruction, detection, and tracking remain imperfect, especially at extreme scales such as tiny objects and drone view, and in scenes with reflective or transparent surfaces. The second is knowledge conflict in all-scale training: similar visual patterns can correspond to very different physical scales, such as a toy car and a real car, or a toy room and a full-sized room. In the paper’s description, naïvely mixing all scales leads to scale-confused representations, visible as a wider spread of the answer/gt ratio away from the ideal value of 1.
Several mitigation strategies are reported. The pipeline uses cross-checking with multiple depth models and the video-consistent depth objective, filtering by bounding-box size, track length, and CoT consistency, and human verification for a subset of the training QA through MTurk workers. Per-task human validation accuracy against pipeline answers is reported as mostly high, around 80–95%, while tasks such as route plan and navigation are harder, around 50–60%, indicating ambiguity or disagreement even for human annotators. The intended use is therefore learning human-like perception, not high-precision metrology.
Evaluation is separated into SpaceVista-Bench, a distinct benchmark of about 3,000 QA pairs across about 500 video scenes with approximately 99% accuracy of benchmark labels. It covers tiny tabletop, tabletop, indoor, and outdoor settings and is built through self-recorded videos of around 50 objects of precisely measured sizes, retrieval of authoritative measurements such as Wikipedia entries and architectural plans, and manual annotation by human experts for non-metric tasks. The authors state that there is no scene overlap between training and benchmark data. A common misconception is therefore that SpaceVista-1M itself functions as the definitive test set for all-scale spatial reasoning; the paper explicitly rejects that usage.
6. Use in training SpaceVista-7B and empirical outcomes
SpaceVista-1M is the core training resource for SpaceVista-7B, a specialized spatial reasoning MLLM built on Qwen2.5-VL-7B (Sun et al., 10 Oct 2025). The model combines the Qwen2.5-VL video encoder, which produces semantic tokens 2, with a DINOv3 auxiliary encoder that provides dense, semantics-agnostic features 3 intended to capture depth, normals, and other geometric patterns. After alignment to the same spatial resolution and feature dimension, the features are fused by cross-attention:
4
This setup is explicitly tied to SpaceVista-1M’s rich spatial supervision and is meant to enhance geometry-aware perception beyond semantics.
To mitigate cross-scale conflict, the model introduces LoRA-like scale experts on top of the LLM. Each projection layer contains a base weight 5 and a weighted sum of rank-6 LoRA modules:
7
where the routing weights 8 are produced by a scale router, described as an MLP + softmax conditioned on the input and potentially on scale metadata. Experts are trained per scale, and only about 0.5% of total parameters per expert are trainable. The dataset’s explicit scale labels are therefore not incidental metadata; they are used directly to maintain independence of scale-specific knowledge.
Training proceeds in stages: SFT with CoT on SpaceVista-1M, activation of the scale router for further expert specialization, and RL training (GRPO-style) on the multiple-choice and regression subset. Reward design uses three anchors—<semantics>, <scale>, and <answer>—with semantic reward from cosine similarity, scale reward from log-scale discrepancy, and task-specific answer reward. The updated correctness reward is defined as
9
so that later anchors contribute less when earlier anchors fail. SpaceVista-1M is essential here because it supplies the per-sample scale label, semantic metadata, and a large number of MC and regression QAs.
Reported evaluation covers MMSI-Bench, SPAR-Bench, VSI-Bench, STI-Bench, and SpaceVista-Bench. The paper gives the following scores: Qwen2.5-VL-7B achieves 31.7 / 33.1 / 32.7 / 32.1 / 28.9 on those five benchmarks; Qwen2.5-VL-7B w/ SpaceVista-1M achieves 27.3 / 36.9 / 42.0 / 35.0 / 29.5; SpaceVista-7B (SFT) reaches 29.1 / 38.1 / 46.3 / 35.9 / 34.5; and SpaceVista-7B w/ RL reaches 32.3 / 41.6 / 48.6 / 38.2 / 36.7. The degradation on MMSI after naïve finetuning and the subsequent gains from scale-aware experts and RL are presented as evidence that raw all-scale data mixing is suboptimal, whereas scale-aware use of SpaceVista-1M improves generalization.
7. Position within the dataset landscape, release, and limitations
Relative to prior spatial QA datasets, SpaceVista-1M is distinguished by its combination of scale breadth, scene diversity, task diversity, and video-based grounding (Sun et al., 10 Oct 2025). The comparison table in the paper reports SpaceR at 191K QA pairs / 1.2K video scenes / QA per scene 159, SPAR-7M at 7M / 4.5K / 1,556, Spatial-MLLM at 120K / 1.5K / 83, InternSpatial at 2.5M / 5.5K / 455, and SpaceVista-1M at 1M / 38K / 25. The lower QA/Scene ratio (25) is interpreted in the source material as evidence of many distinct scenes with relatively few questions per scene, mitigating overfitting to a small number of scanned apartments or rooms.
The dataset, model, and benchmark are stated to be released through the project page https://peiwensun2000.github.io/mm2km/. Licensing is reported as Apache License 2.0 and CC BY 4.0, aligned with source datasets. The data format includes videos or paths to videos, camera intrinsics and extrinsics, depth and normals, object detections with boxes and categories, instance masks and IDs, and QA JSON containing question, answer, rationale, task type, and scale type. Intended usage includes training spatial MLLMs for all-scale spatial reasoning, robotics, autonomous-driving-like perception, drone aerial understanding, and reinforcement learning with structured rewards.
The stated limitations are substantial. Specialist-generated labels are noisy and may align better with human perceptual correctness than with precise physical measurement. Certain scales remain underrepresented, including true micrometer-level scenes and extremely large km0 satellite imagery. Some application domains, including medical, industrial inspection, and long-range remote sensing, are missing or only lightly represented. Depth and reconstruction remain imperfect under occlusion and on reflective surfaces, and some QA instances may encode these biases. The authors accordingly recommend using SpaceVista-1M primarily for training and model development, while reserving SpaceVista-Bench and other benchmarks for rigorous evaluation.