SpaceVista-Bench: All-Scale Visual Reasoning

Updated 4 July 2026

SpaceVista-Bench is a hand-crafted, video-based benchmark designed to evaluate all-scale visual spatial reasoning across tiny, tabletop, indoor, and outdoor scenarios.
It employs a hybrid methodology combining manual recording, authoritative retrieval, and human annotation to ensure physically grounded and reliable evaluation.
The benchmark highlights cross-scale generalization and exposes model limitations beyond indoor-centric priors, offering actionable insights for multimodal model improvements.

Searching arXiv for SpaceVista-Bench and closely related benchmark papers to ground the article. SpaceVista-Bench is a hand-crafted, video-based benchmark for all-scale visual spatial reasoning introduced alongside the SpaceVista framework in “SpaceVista: All-Scale Visual Spatial Reasoning from mm to km” (Sun et al., 10 Oct 2025). It is defined as the evaluation counterpart to SpaceVista-1M: whereas SpaceVista-1M is a large automatically curated training dataset, SpaceVista-Bench is a physically grounded benchmark built through manual recording, authoritative retrieval, and human annotation to provide more reliable evaluation. Its purpose is to measure whether multimodal models can reason across widely different physical scales and environments rather than overfit to indoor-centric spatial priors, with benchmark scenarios reported as Tiny Tabletop, Tabletop, Indoor, and Outdoor (Sun et al., 10 Oct 2025).

1. Benchmark identity and scope

SpaceVista-Bench is presented as an all-scale benchmark for visual spatial reasoning spanning scenarios that range from tiny measured objects to outdoor scenes, within the broader project framing of “mm to km” (Sun et al., 10 Oct 2025). The paper distinguishes it sharply from the associated training corpus. SpaceVista-1M is generated through a specialist-driven automated pipeline over public video datasets, while SpaceVista-Bench is reserved for evaluation because, in the authors’ formulation, specialist-generated data may be useful for supervision but is “not reliable for evaluation” (Sun et al., 10 Oct 2025).

The benchmark is video-based and uses regression and multiple-choice answer formats (Sun et al., 10 Oct 2025). It is intended to probe visual spatial reasoning under physical-world constraints, including metric and relational understanding across different spatial scales. The benchmark leaderboard is organized into four scenario buckets: Tiny Tabletop, Tabletop, Indoor, and Outdoor (Sun et al., 10 Oct 2025). This structure differentiates SpaceVista-Bench from spatial benchmarks that remain predominantly indoor, single-scale, or image-only.

A central design goal is cross-scale generalization. The paper argues that existing work has made progress on indoor scenes but still struggles with broader applications such as robotics and autonomous driving, partly because of “heavy reliance on indoor 3D scans” and the absence of effective all-scale scene modeling (Sun et al., 10 Oct 2025). SpaceVista-Bench operationalizes this critique by evaluating models on a physically diverse benchmark rather than on an indoor-only testbed.

2. Construction methodology and data sources

SpaceVista-Bench is built through manual recording, retrieval from authoritative sources, and human annotation rather than through the automated specialist pipeline used for SpaceVista-1M (Sun et al., 10 Oct 2025). The paper describes two benchmark-construction components.

The first component is the measurement-related portion. For this part, the authors collect approximately 500 videos across diverse scenes using self-recorded, measured videos; existing videos enhanced by retrieving authoritative public information; and human annotation for other spatial tasks. This portion covers tiny, tabletop, and outdoor settings. For indoor evaluation, they instead select suitable data from ScanNet-based datasets such as VSI-Bench and SPAR-Bench and construct scale-focused questions on top (Sun et al., 10 Oct 2025).

The second component is the non-measurement portion, for which the collected data are manually annotated to produce additional spatial reasoning QA pairs (Sun et al., 10 Oct 2025). In total, the benchmark contains more than 3,000 QA pairs over 500 unique video scenes, with a reported QA-per-scene ratio of 6 (Sun et al., 10 Oct 2025).

The self-collected component is described in more detail. For tiny and tabletop scenes, the authors capture and annotate videos of over 50 objects of different sizes using GoPro 11, iPhone 15, and Vivo X70. They systematically vary object arrangements, distances, lighting conditions, and backgrounds, producing over 200 videos and 1,000 QA pairs. The self-collected objects span nearly 50 categories and include sizes from 0.4 m to 3 mm, including transparent and reflective objects (Sun et al., 10 Oct 2025). This measured-data component is intended to ground benchmark answers in direct physical measurement rather than inferred estimates.

For indoor and outdoor scenes where direct measurement is less straightforward, the benchmark incorporates authoritative retrieval. The paper states that the authors identify landmarks or scene elements and retrieve statistics from sources such as Wikipedia, architectural drawings, and official design documents (Sun et al., 10 Oct 2025). This suggests a hybrid epistemic strategy: direct measurement where feasible, authoritative external reference where necessary, and human annotation for the remaining spatial tasks.

3. Annotation reliability and evaluation integrity

The paper frames SpaceVista-Bench as a reliability-oriented benchmark. It reports “99% accuracy across 500 unique video scenes” for benchmark annotations and repeatedly contrasts this with the lower trustworthiness of specialist-model-generated labels for evaluation (Sun et al., 10 Oct 2025). The benchmark is described as “fully human-annotated” and “strictly adher[ing] to physical world measurements and perceptions” (Sun et al., 10 Oct 2025).

The construction pipeline includes several integrity safeguards. Most importantly, the authors state that they remove from the training set any scene that appears in the benchmark or in the other evaluated benchmarks, to prevent leakage and to support a fair assessment of generalization (Sun et al., 10 Oct 2025). This strict scene-level separation is one of the benchmark’s key methodological claims.

The paper also distinguishes benchmark evaluation from training-data quality control. It states that perceptual correctness is used in training-data filtering, whereas benchmark evaluation follows strict correctness (Sun et al., 10 Oct 2025). This distinction matters because the broader SpaceVista project uses specialist models and human perceptual validation to build a large training corpus, but the benchmark itself is meant to be materially more exacting.

What the paper does not provide is also noteworthy. It does not report inter-annotator agreement statistics, a per-task benchmark count table, or explicit split details internal to SpaceVista-Bench itself (Sun et al., 10 Oct 2025). This suggests that the benchmark’s reliability claim rests primarily on construction protocol and the stated 99% annotation accuracy, rather than on a formal agreement study.

4. Benchmark composition and task coverage

The benchmark contains more than 3,000 QA pairs over 500 unique video scenes and uses video as the primary modality (Sun et al., 10 Oct 2025). Its reported answer formats are regression and multiple-choice (Sun et al., 10 Oct 2025). The four reported scenario subsets are Tiny Tabletop, Tabletop, Indoor, and Outdoor (Sun et al., 10 Oct 2025).

The paper does not provide a benchmark-only table with exact task counts, but SpaceVista-Bench is clearly derived from the broader SpaceVista task system, which spans 19 task types in SpaceVista-1M (Sun et al., 10 Oct 2025). The article therefore cannot treat all 19 task types as explicitly benchmarked in precisely known proportions. What can be stated is that the benchmark includes both measurement-related and non-measurement spatial QA (Sun et al., 10 Oct 2025).

The broader project’s task family includes position comparison, size comparison, existence estimation, rotation estimation, relative and absolute distance, object counting, object size, route planning, appearance order, depth estimation, view-change inference, object matching, spatial relation, room size, area estimation, and manipulation-related tabletop tasks (Sun et al., 10 Oct 2025). This suggests that SpaceVista-Bench is intended to sample a broad spectrum of spatial reasoning behaviors rather than a single subskill. A plausible implication is that the benchmark is designed less as a narrow leaderboard for one task and more as a stress test for spatial competence across scales and environments.

The following table summarizes benchmark properties explicitly stated in the paper.

Aspect	SpaceVista-Bench
Primary modality	Video
QA count	More than 3,000
Unique video scenes	500
Answer formats	Regression, multiple-choice
Scenario subsets reported	Tiny Tabletop, Tabletop, Indoor, Outdoor
Annotation basis	Manual recording, authoritative retrieval, human annotation
Reported annotation accuracy	99%

5. Evaluation protocol and reported results

SpaceVista-Bench is used as one of five evaluated benchmarks in the paper’s experimental section (Sun et al., 10 Oct 2025). The benchmark is scored in a percentage-style aggregate metric, but the paper does not specify the exact numeric tolerance or scoring rule for regression questions (Sun et al., 10 Oct 2025). It does state that evaluation follows the configuration used in the official Qwen2.5-VL demo with $\text{top}_p = 0.001$ and temperature $= 0.01$ , and that baselines are evaluated at the same resolution and FPS on the benchmark leaderboard (Sun et al., 10 Oct 2025).

The main benchmark result reported in the paper is that SpaceVista-7B with reinforcement learning achieves an overall SpaceVista-Bench score of 36.7, outperforming both major proprietary and open-source baselines (Sun et al., 10 Oct 2025). In the comparison across five spatial reasoning benchmarks, GPT-5 reaches 33.7 and Gemini-2.5-Pro reaches 33.8 on SpaceVista-Bench, while large open-source models such as InternVL3.5-38B and Qwen2.5-VL-72B score 30.7 and 31.1 respectively (Sun et al., 10 Oct 2025).

The dedicated SpaceVista-Bench leaderboard reports subset performance across Tiny Tabletop, Tabletop, Indoor, Outdoor, and Overall (Sun et al., 10 Oct 2025). SpaceVista-7B is reported as achieving 33.4 on Tiny Tabletop, 37.1 on Tabletop, 42.2 on Indoor, 34.1 on Outdoor, and 36.7 overall (Sun et al., 10 Oct 2025). The paper emphasizes that its main advantage is not necessarily winning every subset outright, but attaining comparatively high scores across all scenarios and the highest overall score (Sun et al., 10 Oct 2025).

A concise selection of reported overall results is given below.

Model	SpaceVista-Bench
GPT-5	33.7
Gemini-2.5-Pro	33.8
InternVL3.5-38B	30.7
Qwen2.5-VL-72B	31.1
Qwen2.5-VL-7B	28.9
Qwen2.5-VL-7B w/. SpaceVista-1M	29.5
SpaceVista-7B	34.5
SpaceVista-7B w/. RL	36.7

These results are used by the authors to support the claim that their scale-aware modeling and progressive training paradigm yield superior all-scale spatial reasoning (Sun et al., 10 Oct 2025). The relatively small gain from simply fine-tuning Qwen2.5-VL-7B on SpaceVista-1M, from 28.9 to 29.5, is contrasted with the larger jump produced by the full SpaceVista-7B design, suggesting that data scale alone is not sufficient (Sun et al., 10 Oct 2025).

6. Ablations and what the benchmark reveals

SpaceVista-Bench is central to the paper’s ablation studies. In a module ablation using Qwen-2.5-VL-3B, a vanilla baseline scores 31.0 on SpaceVista-Bench, while adding scale modeling raises it to 34.8, and adding scale plus semantic anchors raises it further to 35.4 (Sun et al., 10 Oct 2025). This is presented as evidence that scale-aware reasoning components materially help on the benchmark.

In a modality ablation, adding VGGT features raises performance only slightly from 31.0 to 31.4, whereas DINOv3 raises it to 32.1; combining VGGT and DINOv3 gives 31.7 (Sun et al., 10 Oct 2025). The paper interprets this as evidence that dense self-supervised features are especially useful for all-scale spatial reasoning. A plausible implication is that SpaceVista-Bench rewards rich geometric and appearance-sensitive cues beyond ordinary semantic embeddings.

An ablation over the number of experts also uses SpaceVista-Bench. With no experts, the model scores 31.0; with one expert, still 31.0; with two experts, 32.7; and with four experts, 32.9 (Sun et al., 10 Oct 2025). This supports the paper’s claim that cross-scale knowledge conflict is real and that a single undifferentiated expert does not solve it.

A further ablation comparing training strategies reports 29.5 for full-parameter fine-tuning, 29.4 for vanilla LoRA, 32.5 for a model-wise LoRA-like expert, and 33.0 for a layer-wise LoRA-like expert on SpaceVista-Bench (Sun et al., 10 Oct 2025). The paper uses this to argue that multi-expert scale-aware routing is not reducible to standard low-rank adaptation.

The benchmark also underpins the paper’s broader diagnosis of current model failure modes. One recurring claim is that models rely too heavily on memorized object-size priors rather than context-sensitive reasoning (Sun et al., 10 Oct 2025). The appendix’s “Reasoning vs Memorizing” discussion is used to argue that SpaceVista-Bench helps expose this weakness, particularly by including seen objects at varying scales and unseen objects (Sun et al., 10 Oct 2025). This suggests that benchmark difficulty does not stem only from domain transfer, but also from conflicts between prior semantic expectations and actual physical scale cues.

7. Relation to the broader spatial-benchmark landscape

SpaceVista-Bench occupies a distinctive position relative to neighboring spatial benchmarks. Unlike SpaceSense-Bench, which targets close-range robotic spacecraft perception with RGB, depth, LiDAR, part-level semantics, and pose labels for non-cooperative spacecraft (Wu et al., 10 Mar 2026), SpaceVista-Bench is not spacecraft-specific and instead targets all-scale visual spatial reasoning across tiny tabletop, indoor, and outdoor video scenes (Sun et al., 10 Oct 2025). The two benchmarks address different operational domains: SpaceSense-Bench is specialized for autonomous space operations, whereas SpaceVista-Bench is broader and more heterogeneous in physical scale.

Unlike OVO-S-Bench, which evaluates streaming spatial intelligence over egocentric video prefixes with explicit query timestamps and evidence intervals (Li et al., 2 Jun 2026), SpaceVista-Bench is described primarily as a video-based benchmark for all-scale reasoning, without an explicitly formalized streaming-prefix protocol in the text (Sun et al., 10 Oct 2025). This suggests that the two benchmarks differ in their notion of temporal access: OVO-S-Bench is built around causal streaming, while SpaceVista-Bench is framed around all-scale scene reasoning.

Unlike CVSBench, which focuses on cross-view spatial reasoning between satellite and street imagery through VQA, grounding, and viewpoint identification (Liu et al., 21 Jun 2026), SpaceVista-Bench is not presented as a cross-view satellite–street benchmark. Its emphasis is not heterogeneous viewpoint transfer but scale diversity and physically grounded measurement-oriented reasoning (Sun et al., 10 Oct 2025).

Unlike E3VS-Bench, which isolates viewpoint-dependent active perception in 3D Gaussian Splatting scenes under unrestricted 5-DoF control (Sakamoto et al., 20 Apr 2026), SpaceVista-Bench is not formulated as an embodied active-search benchmark. The paper does not describe action spaces, navigation policies, or answerable-viewpoint annotations for SpaceVista-Bench (Sun et al., 10 Oct 2025). This suggests a different evaluation regime: all-scale reasoning from video scenes rather than active 3D exploration.

A plausible editorial shorthand is that SpaceVista-Bench sits at the intersection of scale diversity and physically grounded annotation. It is broader in scene scale than many indoor-centric spatial benchmarks, but less explicitly agentic than streaming or embodied benchmarks such as OVO-S-Bench (Li et al., 2 Jun 2026) or E3VS-Bench (Sakamoto et al., 20 Apr 2026).

8. Limitations and unresolved details

The paper is explicit about some limitations and silent on others. Most notably, it does not provide exact per-task benchmark counts, exact per-scenario sample counts, internal split definitions, or inter-annotator agreement statistics for SpaceVista-Bench (Sun et al., 10 Oct 2025). It also does not give a formal benchmark-wide scoring equation clarifying how regression answers are judged or normalized relative to multiple-choice answers (Sun et al., 10 Oct 2025).

These omissions matter for reproducibility and fine-grained interpretation. For example, because regression tolerance is unspecified, direct comparison of scores across benchmarks or across implementations may be less transparent than in benchmarks with fully formalized metrics. Likewise, because scenario-level counts are not reported, it is difficult to infer whether overall performance is dominated by indoor questions or more evenly balanced across Tiny Tabletop, Tabletop, Indoor, and Outdoor.

Another limitation is that the benchmark’s role is primarily evaluative. The paper does not present SpaceVista-Bench as a fully formalized standalone dataset paper with a separate methodological appendix devoted solely to benchmark protocol; instead, it is embedded in the larger SpaceVista framework (Sun et al., 10 Oct 2025). This suggests that further documentation in released code or dataset artifacts may be necessary to fully operationalize the benchmark.

At the same time, the benchmark’s construction philosophy is unusually clear. It is meant to counter two common problems in contemporary benchmark design: unreliable automated labels and scene-scale narrowness (Sun et al., 10 Oct 2025). Even where formal specification is incomplete, the benchmark’s conceptual contribution is precise: it proposes that spatial evaluation should be anchored to the physical world, span multiple scales, and avoid over-reliance on indoor-scene biases.

9. Significance

SpaceVista-Bench’s main significance lies in its attempt to make all-scale spatial reasoning a first-class evaluation target. The paper’s broader claim is that spatial intelligence in multimodal models has been constrained by indoor-scene dominance and inadequate scale modeling (Sun et al., 10 Oct 2025). SpaceVista-Bench embodies the evaluation response to that claim: it uses physically grounded videos, direct measurement, authoritative retrieval, and human annotation to test whether models can generalize across tiny objects, tabletop arrangements, indoor scenes, and outdoor environments (Sun et al., 10 Oct 2025).

The benchmark also supports a methodological argument about evaluation itself. The authors contend that specialist models can inject useful domain knowledge into training data but should not be trusted as evaluation authorities (Sun et al., 10 Oct 2025). SpaceVista-Bench is the concrete implementation of that argument. In that sense, its importance is not only empirical but epistemic: it proposes a stricter standard for spatial benchmark construction in which answers are aligned to measured or authoritative physical-world information.

This suggests a broader implication for the spatial-reasoning literature. As benchmarks diversify into streaming settings (Li et al., 2 Jun 2026), cross-view geospatial settings (Liu et al., 21 Jun 2026), active free-viewpoint exploration (Sakamoto et al., 20 Apr 2026), and spacecraft-centric robotic perception (Wu et al., 10 Mar 2026), SpaceVista-Bench contributes a complementary axis: scale generalization under physically grounded evaluation (Sun et al., 10 Oct 2025). Its reported results indicate that even strong proprietary and open-source models remain far from robust all-scale spatial competence, and that balanced performance across scales is harder than strong performance on any single subset (Sun et al., 10 Oct 2025).