SpatialScore: Unified Spatial Benchmark
- SpatialScore is a unified evaluation framework with diverse metrics targeting both 2D and 3D spatial reasoning in multimodal systems, emphasizing precise geometric computation.
- It integrates a large-scale benchmark drawn from 12 datasets, covering tasks from counting and object localization to depth reasoning and 3D reconstruction.
- SpatialScore serves as both a performance metric and a reward model, enhancing evaluation in visual programming, text-to-image generation, and embodied AI applications.
SpatialScore is a unified evaluation framework and metric family designed to rigorously measure fine-grained spatial understanding in multimodal vision-language systems. It has recently emerged as a de facto standard for assessing 2D and 3D spatial reasoning, geometry perception, and spatial alignment across multimodal LLMs (MLLMs), visual programming agents, and text-to-image generative models. Several variants exist, with the term “SpatialScore” used for (I) a comprehensive spatial reasoning benchmark suite for MLLMs and agents (Wu et al., 22 May 2025), (II) a scalar metric for programmatic evaluation in visual programming settings (Wu et al., 24 Dec 2025), and (III) a dedicated reward model and metric for spatial understanding in image generation (Tang et al., 27 Feb 2026).
1. Conceptual Foundations and Motivation
SpatialScore was introduced to address persistent shortcomings in MLLMs’ spatial reasoning capabilities and to provide a unified, rigorous, and diverse benchmark that spans the challenging spectrum from basic 2D relations to advanced 3D geometry and multi-step spatial tasks (Wu et al., 22 May 2025). Existing benchmarks were limited in scope or failed to probe metric or compositional understanding, focusing mainly on semantic or superficial visual relations. In contrast, SpatialScore emphasizes precise geometric computation, including camera pose estimation, metric depth reasoning, object-to-object spatial relations, homographies, and object property attribution.
A central motivation is to drive progress in domains such as robotics, embodied AI, and autonomous navigation, where spatial perception is foundational for real-world decision making (Wu et al., 22 May 2025).
2. Benchmark Construction and Task Taxonomy
SpatialScore comprises a large-scale, multi-format spatial reasoning dataset, constructed by integrating samples from 12 public datasets, notably including the VGBench synthetic 3D-geometry QA suite (Wu et al., 22 May 2025). The full benchmark contains 28,000 question–answer pairs, balancing single-image, multi-image, and video-based challenges.
The tasks are meticulously organized into eight broad classes:
- Counting: Predicting integer object counts from images or video;
- Object Localization: 2D/3D bounding box or mask outputs (“where is...?”);
- 3D Positional Relations: Relative spatial judgments (“A is closer than B”);
- Depth and Distance Reasoning: Metric depth estimation and object-to-object distance queries, typically in meters;
- Object Properties: Area ratios, real-world size, and orientation angle extraction;
- Camera and Image Transformation: Homography estimation (3×3 matrix or image warp), and camera pose determination (intrinsic/extrinsic parameters);
- Point/Object Tracking: Correspondence tracking across frames or videos;
- 3D Reconstruction and Dynamic Tasks: Identifying correct target views or transformation results (Wu et al., 22 May 2025).
A curated subset, “SpatialScore-Hard” (1,400 samples), emphasizes cases with subtle geometric distractors, precise metric demands, and multi-step or video-based reasoning, systematically chosen based on broad MLLM failure and manually verified correctness (Wu et al., 22 May 2025).
3. Metric Definition and Evaluation Protocol
General Benchmark Scoring
For each question–answer pair, a binary correctness indicator is computed:
- Judgment/multiple-choice: 1 if prediction matches the ground truth, else 0.
- Simple counts: exact match required.
- Numeric tasks (depth, distance, size): Tolerance-based, accepting predictions in the range (Wu et al., 22 May 2025).
The SpatialScore is then:
where is the number of cases.
Visual Programming and TVP Protocol
In visual programming settings (as with Transductive Visual Programming, TVP), a more stringent protocol is adopted (Wu et al., 24 Dec 2025):
- Binary (yes/no, multiple-choice): if exact match; $0$ otherwise.
- Numeric open-ended: if with (i.e., within relative error); $0$ otherwise.
- Aggregate: Mean over all 0, times 1.
SpatialScore as Reward Model in Image Generation
In text-to-image scenarios, SpatialScore is implemented as a scorer or reward model trained on human-verified preference pairs (Tang et al., 27 Feb 2026). Here, for each image–prompt pair, a scalar reward is computed via a neural head atop a VLM backbone, facilitating automated ranking and reinforcement learning optimization for spatial alignment in synthesized images.
4. SpatialScore-Hard: Composition and Fidelity
The SpatialScore-Hard collection targets systematic weaknesses in current models (Wu et al., 22 May 2025, Wu et al., 24 Dec 2025). For MLLMs and agents, this subset consists of 256–1,400 samples sampled from benchmarks such as 3DSR-Bench, SpatialSense, and VG-Bench. Queries are divided into four canonical categories:
- Object Properties (color, material, simple counts)
- Object Localization (e.g., “furthest left”)
- Depth & Distance Estimation (e.g., “distance from camera”)
- 3D Positional Relations (e.g., “tallest cabinet (real-world units)”)
Hard queries emphasize multi-step geometric and arithmetic reasoning, such as pixel-to-real-world conversions, chained ratios or volumes, object occlusion handling, and tie-breaking among visually similar items.
The evaluation protocol for SpatialScore-Hard enforces zero-shot transfer—no test-set–specific adaptation—and requires models to use only preexisting libraries or experience, e.g., TVP’s Example and Tool Libraries learned on prior tasks (Wu et al., 24 Dec 2025).
5. System Architectures Leveraging SpatialScore
SpatialAgent
SpatialAgent is a multi-agent architecture integrating nine specialized expert tools for geometry, each addressing a specific sub-task: detection, localization, masking, optical flow, feature matching, homography, geometric property estimation, depth estimation, and orientation extraction (Wu et al., 22 May 2025). It supports both Plan-Execute and ReAct frameworks, facilitating flexible problem decomposition and stepwise reasoning. The Plan-Execute paradigm involves explicit planning followed by sequential execution, while ReAct allows for interleaved observation and execution with dynamic memory integration.
SpatialAgent and similar architectures use SpatialScore as the principal metric for gauging end-to-end spatial reasoning and for quantitative ablation of tool contributions and planning/interleaving strategies.
Reward Modeling in Image Generation
SpatialScore, as a reward model, is trained on the SpatialReward-Dataset, which consists of 80,000 preference pairs of “perfect” vs. “perturbed” scene images for spatial relation accuracy (Tang et al., 27 Feb 2026). A vision-language architecture (Qwen2.5-VL-7B) produces a reward score, with training via Bradley–Terry pairwise ranking loss, and a single sample from a learned Gaussian serves as the reward output. This enables integration in online RL, driving fine-tuning for improved spatial alignment in generated images via policy-gradient objectives such as Group Relative Policy Optimization (GRPO).
6. Comparative Results and Analytical Insights
Baseline Model Performance
On the full benchmark, state-of-the-art MLLMs (1B–78B) achieve 40–60% accuracy, with largest models (e.g., InternVL3-78B) reaching ~60.2%. On SpatialScore-Hard, accuracy drops below 30% for all standalone models. Particularly, 3D geometry tasks (pose, homography, 3D reconstruction) present substantial challenges (<50% accuracy) (Wu et al., 22 May 2025).
TVP, evaluated under zero-shot transfer, outperforms GPT-4o by approximately 20% absolute on SpatialScore-Hard, achieving consistent lead across all query categories (e.g., 68% in Object Properties, 54% in Localization, 44% in Depth & Distance, 59% in 3D Relations) (Wu et al., 24 Dec 2025). Table 1 summarizes selected results:
| Method | 3DSR-Bench | SpatialSense | VG-Bench | Overall SpatialScore (%) |
|---|---|---|---|---|
| TVP (ours, zero-shot) | 52.9 | 59.2 | 43.8 | 52.3 |
| GPT-4o | 52.1 | 46.5 | 20.3 | 42.6 |
| VADAR | 24.8 | 40.8 | 39.1 | 32.8 |
Reward Model Results
As a reward model, SpatialScore (7B) attains 0.958 pairwise accuracy on held-out human-validated spatial preference pairs, outperforming leading proprietary models (GPT-5: 0.890; Gemini-2.5 Pro: 0.951) (Tang et al., 27 Feb 2026). When employed for RL-based fine-tuning, it enables in-domain SpatialScore improvements from 2.18 to 7.81 and boosts the Relation-Spatial dimension in DPG-Bench from 0.871 to 0.932 for text-to-image models.
Analytical Insights
Observed failure modes include tool misuse, noisy geometric outputs, ambiguity in unit interpretation, and persistent gaps in compositional spatial arithmetic. The findings highlight the necessity of explicit geometric computation and modular reasoning for precision tasks that conventional MLLMs and VLMs inadequately address.
TVP’s superior performance on hard queries is attributed to transductive abstraction (clustering and abstracting proven sub-solutions), rigorous validation (each tool tested for ≥85% correctness), and tool merging (broadening abstraction coverage). This approach yields high-frequency, reusable routines for complex spatial queries, resulting in significant absolute performance gains (Wu et al., 24 Dec 2025).
7. Extensions, Limitations, and Future Challenges
Key limitations include the static nature of the image-centric datasets for both benchmarking and reward modeling, the reliance on human-verified or synthetic data for challenging spatial relations, and possible underrepresentation of rare spatial constructs (Tang et al., 27 Feb 2026, Wu et al., 22 May 2025). Extending SpatialScore to handle dynamic scenes, temporal reasoning in video, and raw 3D inputs remains an open challenge.
For reward modeling, additional calibration—such as advanced variance control or alternative sampling strategies—may enhance stability in edge cases. Further, future directions involve seamless integration of semantic and geometric reasoning pipelines, minimizing dependency on external toolchains, and extending compositional reasoning to continually expanding domains.