What'sUp Benchmark for Spatial Reasoning

Updated 3 October 2025

What'sUp Benchmark is a dataset that enhances spatial reasoning evaluation with object-level bounding boxes, segmentation masks, and depth maps.
It employs multi-faceted evaluation protocols, including multiple-choice and template-based generation prompting with CircularEval to minimize biases.
Benchmark results reveal that multimodal large language models outperform classic VLMs, highlighting scaling benefits and persistent challenges in spatial relation reasoning.

The What'sUp Benchmark is a dataset and evaluation framework designed for rigorous analysis of spatial relationship understanding in vision-LLMs (VLMs) and multimodal LLMs (MLLMs). It targets the ability of models to recognize, localize, and reason about spatial relations between objects depicted in images—capabilities fundamental to grounded visual reasoning. The original What'sUp dataset has been substantially extended to support detailed, grounded assessments by incorporating object-level bounding boxes, segmentation masks, and depth maps, facilitating the decoupling of recognition, localization, and relational reasoning tasks.

1. Dataset Extension and Grounded Annotations

The extension of the What'sUp dataset, detailed in (Rajabi et al., 19 Jun 2024), introduces three new layers of annotation specifically for grounded spatial reasoning:

Bounding box coordinates for each captioned object, supporting precise evaluation of localization.
Segmentation masks derived from segmentation models (e.g., SAM), providing pixel-level delineation of object extents.
Depth maps generated by monocular depth estimation (e.g., ZoeDepth), enabling spatial relationship evaluation in three-dimensional space.

This annotation scheme allows the benchmark to distinguish between model failures in recognition (detecting the correct object) versus localization (identifying spatial extent), and to move beyond image-text matching toward evaluations rooted in geometric and relational correctness.

2. Evaluation Methodology and Metrics

The What'sUp Benchmark implements a multi-faceted evaluation protocol to assess both spatial reasoning and the quality of object grounding:

Prompting strategies:
- Multiple-Choice (MC) Prompting: Models select from discrete options encoding spatial relations. However, MC prompts are shown to be susceptible to systematic answer-position biases, particularly for smaller models.
- Template-based Generation (TG) Prompting: Models fill a structured template with spatial predicate(s), limiting verbosity and reducing bias.
CircularEval: To address permutation sensitivity, four variants of each prompt (with different answer orderings) are evaluated. A model is credited with a correct answer only if it provides the correct output for all permutations, minimizing the confounding effects of answer ordering.
Localization evaluation: Models predict normalized bounding box coordinates for the subject and object. The predicted and ground truth bounding boxes are compared using the Intersection over Union (IoU) metric:

$\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}$

A threshold of $\text{IoU} \geq 0.5$ is applied; only predictions exceeding this criterion are considered correct.

This combined approach ensures a comprehensive diagnosis of model strengths and weaknesses involving both semantic (relation comprehension) and geometric (localization accuracy) aspects.

3. Model Classes, Parameterization, and Training Regimes

The benchmark evaluates 27 models, stratified into three primary MLLM classes, and compares them with previously tested VLMs:

Model Class	Parameter Range	Distinctive Features
VLMs	—	CLIP, BLIP, XVLM; classic vision-language pre-training
MLLMs - Small	~7B parameters	LLaVA-1.5-Vicuna 7B, LLaVA-NeXT-Mistral 7B
MLLMs - Large	to 110B parameters	LLaVA-NeXT-Qwen1.5

Key factors affecting performance include:

Parameter count: Models span from 7B to 110B parameters, revealing scaling trends.
Training and instruction tuning: Some models undergo finetuning via contrastive losses (image-text matching); others use generative instruction tuning.
Visual input resolution: Variations inherited from pre-training yield differences in fine-grained spatial reasoning and object localization, notably in scenarios involving small or occluded objects.

4. Benchmarking Results and Scaling Effects

The comprehensive performance analysis reveals:

MLLMs outperform VLMs: Generative MLLMs (notably LLaVA variants) substantially exceed VLMs in spatial reasoning tasks. For example, LLaMA-3-LLaVA-NeXT-8B records 86.1% accuracy, compared to 60.4% for XVLM-COCO.
Positive scaling trend: Higher parameter counts and resolution correlate with improved accuracy in both spatial relation prediction and localization (IoU), though evidence of saturation effects is observed in the largest models (e.g., Qwen1.5-110B), where grounding continues to improve but relational reasoning plateaus.
Prompting effects: Structured TG prompting with CircularEval mitigates positional biases inherent to MC formats, providing more reliable estimates of true reasoning capability.
Failure modes: Despite improvements, model performance on spatial prepositions involving depth reasoning (such as "in front of" and "behind") remains sub-optimal, especially when grounding small or ambiguous objects.

5. Technical Protocols and Key Metrics

The benchmark specifies several technical conventions central to its analyses:

Intersection over Union (IoU) for localization:

$\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}$

Normalized coordinates are used. A prediction is correct if $\text{IoU} \geq 0.5$ .

CircularEval: Each instance must yield correct outputs across all four answer orderings.
Prompting schemes: Both MC and TG are standardized; TG is optionally augmented with depth information (depth-augmented prompting, DAP) to disambiguate complex spatial relations.

6. Research Implications and Future Directions

Several open avenues and methodological clarifications arise from this benchmark:

Annotation refinement: The use of automated tools (e.g., GroundingDINO for boxes, SAM for masks, ZoeDepth for depth) is noted, but human-level annotation is suggested as a future step to validate and possibly correct automated outputs.
Architectural innovation: Addressing spatial reasoning deficits for prepositions (especially those encoding depth) remains a key challenge; model architecture changes or novel training paradigms may be required.
Prompting evolution: Exploration of hybrid prompting, including DAP, may help better handle multifaceted spatial relations.
Plateau investigation: Understanding why model scaling ceases to yield commensurate increases in reasoning accuracy at extreme scales is identified as a critical question.
Benchmarking granularity: The extended dataset, now also referred to as GSR-Bench, allows for the design of future benchmarks that explicitly decouple recognition from localization, allowing targeted diagnostics.

7. Contributions to the Evaluation of Spatial Reasoning

The extension of the What'sUp Benchmark, with its detailed annotations and robust multi-pronged evaluation protocol, establishes a state-of-the-art standard for the grounded assessment of spatial reasoning in both classical VLMs and modern MLLMs. It has empirically demonstrated that scaling and architecture critically impact reasoning and localization, and it provides technical infrastructure for nuanced investigation of model behavior on challenging spatial tasks. These advances inform ongoing research into improving multi-modal understanding and pave the way for developing models approaching human-level spatial cognition (Rajabi et al., 19 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to What'sUp Benchmark.