SpatialScore-Hard Collection
- SpatialScore-Hard Collection is a curated suite that isolates the most challenging multimodal spatial reasoning cases by identifying failure modes across top vision-language models.
- It contains 1,400 validated samples balanced across eight spatial categories, curated via a hybrid algorithmic and expert human verification process.
- Evaluation protocols use metrics such as accuracy, numeric tolerance, and mean absolute error to provide a rigorous stress test for advanced 3D spatial understanding in AI systems.
SpatialScore-Hard Collection is a rigorously curated suite designed to characterize the frontier of multimodal spatial reasoning in vision-LLMs (VLMs) and multimodal LLMs (MLLMs). Distinct from generic spatial benchmarks, SpatialScore-Hard aggregates the most challenging samples—those that elude both open-source and commercial models—across diverse spatial tasks, with validated ground truth and category balance. Its construction, evaluation protocols, and experimental impact on state-of-the-art agents mark it as a central resource for diagnosing and advancing 3D spatial perception in modern AI systems (Wu et al., 22 May 2025, Wu et al., 24 Dec 2025).
1. Purpose and Formal Definition
SpatialScore-Hard isolates samples representing genuine failure cases for leading MLLMs in 3D spatial understanding. Traditional benchmarks such as SpatialScore (28,000 samples) encompass a wide range of difficulty, including many items solvable by current models. SpatialScore-Hard’s central mandate is to spotlight edge cases that resist solution by almost all models—making it an ultimate stress test for spatial reasoning, especially in vision-based tasks involving geometric complexity.
Formally, for a voting pool of models (, parameter scale: 1B–78B), and a subset of "large" models (, 32B parameters), for each candidate sample : A sample is placed in SpatialScore-Hard if: with (at least 16/20 models fail) and (at least 2/4 large models fail).
Human annotators then verify label correctness and ensure balanced coverage across task categories. After filtering for ambiguity and category oversampling, the final collection size is samples (Wu et al., 22 May 2025).
2. Dataset Composition and Category Structure
SpatialScore-Hard contains 1,400 validated samples covering eight super-categories:
- Counting
- Object Localization (2D/3D bounding box grounding)
- 3D Positional Relations (e.g., above/behind relationships)
- Depth & Distance Estimation
- Object Properties (size, orientation)
- Camera & Image Transformation (homography, pose estimation)
- Point/Object Tracking
- Others (e.g., route planning in video)
Modalities are distributed as:
- Single-image: ~600
- Multi-image (pairs/sequences): ~400
- Video: ~400
Three QA formats are supported:
- Judgment (yes/no): ~250
- Multiple-choice: ~800
- Open-ended (numeric, free-form): ~350
The data sources span 12 distinct constituents, notably VGBench, SpatialSense, SpatialBench, QSpatialBench, CV-Bench, 3DSRBench, VSI-Bench, MMIU, BLINK, MMVP, RealWorldQA, CA-1M/ScanNet subsets. Per-category representation is balanced for diversity (Wu et al., 22 May 2025, Wu et al., 24 Dec 2025).
3. Selection and Curation Process
SpatialScore-Hard employs a hybrid algorithmic-human methodology:
- Difficulty Screening:
- Compute and for every sample in SpatialScore ().
- Select samples with and (4,300 initial candidates).
- Expert Verification:
- Human experts validate ground truth and discard ambiguous/low-quality items.
- Category rebalance eliminates distortion from oversampled failure types.
- Finalization:
- Retain samples, each meeting strict error and diversity criteria.
The process ensures that remaining samples expose authentic weaknesses in modelled spatial understanding rather than annotation flaws or dataset artifacts (Wu et al., 22 May 2025).
4. Evaluation Protocols and Benchmarks
SpatialScore-Hard supports rigorous evaluation across answer types:
- Accuracy (Judgment & Multiple-choice):
- Numeric Tolerance (Distance/Size, open-ended):
or, in TVP’s protocol for its reduced set,
- Mean Absolute Error (Pose-angle):
Benchmark results on the full collection:
| Method | Overall Accuracy (%) |
|---|---|
| InternVL3-78B | 21.79 |
| GPT-4o (API) | 30.57 |
| SpatialAgent (Intern-PE) | 46.08 |
| SpatialAgent (Qwen-ReAct) | 30.29 |
On the reduced 256-question subset (TVP study):
| Method | 3DSR-B (%) | SpatialSense (%) | VG-Bench (%) | Overall (%) |
|---|---|---|---|---|
| GPT-4o | 52.1 | 46.5 | 20.3 | 42.6 |
| TVP (zero-shot) | 52.9 | 59.2 | 43.8 | 52.3 |
TVP’s transductive tools deliver a +9.5 percentage point overall improvement over GPT-4o, and major per-category gains, with particularly strong zero-shot transfer on difficult metric estimation and 3D relation queries (Wu et al., 22 May 2025, Wu et al., 24 Dec 2025).
5. Task Examples and Difficulty Analysis
Representative SpatialScore-Hard samples include:
- Homography Matrix Matching: Identify which warped image matches a provided matrix —requiring keypoint correspondence estimation and mental inversion of .
- Deep Distance Estimation: Infer metric distances between partially occluded objects, lacking explicit calibration, necessitating contextual inference and multi-step reasoning.
- Relative Camera Pose: Distinguish between candidate pairs for inter-view transformation, necessitating decomposition of optical flow and translation to axis-angle parameterization.
A key property is that all such examples are answered incorrectly by 16/20 models, including at least two "giant" (32B parameter) models. This isolates high-impact failure modes for subsequent model improvements (Wu et al., 22 May 2025).
6. Impact, Generalization, and Experimental Insights
SpatialScore-Hard distinguishes itself from the full SpatialScore benchmark in both scale and function:
- Scope: SpatialScore includes 28,000 samples, ranging from trivial recognition to medium-complexity geometric reasoning. Hard is restricted to 1,400 cases presenting authentic and persistent difficulty for current models.
- Role: SpatialScore tracks overall progress and model scaling trends. SpatialScore-Hard is used to stress-test and drive innovation in architectural, tool-based, or compositional enhancements to spatial reasoning.
TVP (Transductive Visual Programming) demonstrates substantial zero-shot generalization to SpatialScore-Hard, achieving state-of-the-art performance via experience-grounded tool libraries trained solely on Omni3D-Bench. Abstracted tools (such as compute_objects_size_ratio and find_largest_by_3d_metric) account for much of the accuracy and program simplification relative to traditional inductive tool pipelines. Programs relying solely on learned abstractions represent 36.3% of TVP’s solutions and confer complexity reductions and accuracy gains (Wu et al., 24 Dec 2025).
Failure cases typically arise in queries demanding entirely novel spatial abstractions, e.g., angle computations outside learned patterns, forcing fallback to generic tools and reduced accuracy.
7. Research Significance and Directions
SpatialScore-Hard acts as a focal resource for benchmark-driven diagnosis and advancement in multimodal spatial understanding. By concentrating on persistent model blind spots rather than generic tasks, it provides a challenging substrate for evaluating VLM compositionality, geometric reasoning, and tool-based learning agents. Its adoption by advanced frameworks such as SpatialAgent and TVP suggests that future work will increasingly emphasize experience-driven abstraction and targeted error minimization strategies. A plausible implication is that continual tool evolution and hybrid validation pipelines will be essential for reaching human-level 3D spatial reasoning in AI systems (Wu et al., 22 May 2025, Wu et al., 24 Dec 2025).