Visual Spatial Reasoning Dataset

Updated 7 October 2025

Visual Spatial Reasoning (VSR) Dataset is a benchmark suite that assesses AI systems’ ability to interpret and reason about spatial relations in images and 3D environments.
It employs diverse annotation techniques like controlled templates, segmentation masks, and contrastive pairing to capture positional, directional, and metric information.
Key evaluation metrics such as accuracy, mIoU, and success-weighted edit distance highlight the gap between human performance and current AI models in spatial cognition.

Visual Spatial Reasoning (VSR) Dataset refers to a suite of benchmark resources and methodologies that systematically evaluate the capacity of artificial vision-language systems to interpret, infer, and reason about spatial relations in images, video, or 3D environments. Spatial reasoning in this context encompasses the understanding of positional, directional, topological, metric, and frame-of-reference information within visual scenes, often paired with natural-language descriptions. VSR datasets are pivotal for both the general study of visual cognition in AI and for the design, training, and assessment of vision-LLMs (VLMs) and multimodal LLMs (MLLMs).

1. Foundational Concepts and Taxonomy

The essence of visual spatial reasoning comprises several cognitive sub-abilities:

Spatial relations: Understanding predicates such as left/right, above/below, near/far, within, touching, etc. These may span topological, projective, directional, metric, and proximity categories, and involve both egocentric (viewer-based) and allocentric (object- or environment-based) frames of reference (Liu et al., 2022).
Orientation and navigation: Comprehending object and agent orientation, planning or describing movement, and interpreting egocentric (relative to the agent) versus map-based perspectives (Chen et al., 2018, Stogiannidis et al., 25 Mar 2025).
Mental rotation and spatial visualization: Inferring the equivalence of objects across rotations or transformations and mentally simulating folding, unfolding, or assembly processes (Stogiannidis et al., 25 Mar 2025, Wang et al., 10 Jul 2025, Zhang et al., 29 Sep 2025).
Quantitative spatial tasks: Estimating distances, areas, sizes, and route plans in metric space (Yu et al., 23 Sep 2025).

Datasets in this domain vary in whether they provide real-world images, synthetic renderings, 2D projections, or 3D scans, and in the granularity of annotation (binary T/F, multiple-choice, coordinate prediction, segmentation masks, or free-form spatial language).

2. Representative Datasets and Benchmarks

Key Datasets

Dataset Name	Core Focus	Distinguishing Features
VSR (Liu et al., 2022)	2D spatial relations in natural images	10k+ text–image pairs, 66 types, true/false format
Touchdown (Chen et al., 2018)	Navigation and SDR in urban environments	Real Google Street View, navigation+target location, linguistically rich
Jigsaw-Puzzles (2505.20728)	Spatial structure, multi-step reasoning	1,100 real images, 5 tasks from missing piece to order generation
SpinBench (Zhang et al., 29 Sep 2025)	Perspective taking and rotation	Diagnostic sub-tasks, focus on viewpoint transformation
InternSpatial (Deng et al., 23 Jun 2025)	Large-scale spatial QA pairs	12M QA pairs, 19 instruction formats, multi-view supervision
SURPRISE3D (Huang et al., 10 Jul 2025)	3D spatial segmentation reasoning	200k+ language–3D mask, no object names, diverse queries

Additional large-scale or diagnostic datasets such as VSI-100k (Liao et al., 1 Apr 2025), ViCA-322K (Feng, 18 May 2025), SIBench (Yu et al., 23 Sep 2025), and SpatialViz-Bench (Wang et al., 10 Jul 2025) further extend coverage to video, 3D, and controlled synthetic domains.

Annotation Strategies

Annotation procedures typically involve:

Controlled templates: E.g., “The [Object1] is [Relation] [Object2]” with fixed spatial predicates (Liu et al., 2022).
Contrastive pairing: Each statement is matched with a true and control image, controlling for object presence and relation.
Reference frame encoding: Allowing for intrinsic (object-centric) and relative (observer-centric) interpretations, sometimes requiring pre-training for reference frame detection (Liu et al., 2022).
Spatial masks/segmentation: In 3D/scene-graph-based datasets, ground-truth 3D masks corresponding to spatial queries (Huang et al., 10 Jul 2025).
Numeric and geometric annotations: For metric tasks, precise coordinates, distances, or 3D geometry attributes are exposed, often generated from 3D scans or synthetic scenes (Liao et al., 1 Apr 2025, Feng, 18 May 2025, Deng et al., 23 Jun 2025).

3. Task Types and Evaluation Metrics

Core Task Families

Task Category	Representative Datasets	Evaluation Format
Spatial relation recognition	VSR, InternSpatial	Binary / multiclass
Navigation & planning	Touchdown, iVISPAR	Action sequence / path cost
Spatial description resolution	Touchdown (SDR), SURPRISE3D	Pixel/coordinate prediction, segmentation
Perspective taking/mental rotation	SpinBench, SpatialViz-Bench	Multiple choice / accuracy
Video-based spatial reasoning	ViCA-322K, VSI-100k	QA, numerical estimation

Metrics vary by task but include:

Accuracy: For binary or multiclass relation judgment.
F1 score: Especially for segmentation/grounding tasks.
Success-weighted Edit Distance (SED): For comparing predicted and reference action sequences in navigation (Chen et al., 2018).
Mean Intersection over Union (mIoU): For segmentation-based evaluations (Huang et al., 10 Jul 2025).
Mean Relative Accuracy (MRA): For numerical estimation, defined as:

$\mathrm{MRA} = \frac{1}{10} \sum_{\theta \in C} \mathbb{I} \left( \frac{|\hat{y} - y|}{y} < 1-\theta \right)$

where $\hat{y}$ is the predicted value, $y$ is ground truth, and $C$ is a set of thresholds (Yu et al., 23 Sep 2025).

Cohen’s kappa ( $\kappa$ ): To adjust for chance agreement in multiple-choice and consistency tests (Zhang et al., 29 Sep 2025).

4. Notable Modeling Approaches and Frameworks

A variety of modeling strategies have been employed, often motivated by the complexities of spatial reasoning:

Multimodal Transformers: Vision-language transformers (e.g., ViLT, LXMERT) with enhancements such as explicit 3D coordinate, depth, or edge map embeddings and multi-task learning (spatial feature reconstruction) (Islam et al., 3 Oct 2025).
Knowledge distillation: Teacher–student training with privileged spatial masks generated from probabilistic soft logic rules or attention modules (Aditya et al., 2018).
Scene-graph and chain-of-thought reasoning: Dynamic graph representations and iterative, physically-grounded reasoning, particularly for embodied and long-horizon tasks (Zhang et al., 14 Mar 2025).
Text-only grounding: Verbalizing bounding boxes as discretized location tokens, then training LMs to “read” spatial configuration from text only; performance is improved via pretraining on large synthetic data (Azkune et al., 2024).
Model fusion and data augmentation: Combining outputs from multiple vision encoders (CLIP, DINOv2, SAM, SigLIP), as well as controlled visual variation and synthetic data scaling via diffusion models (Xie et al., 2024).
Reinforcement learning: Group Relative Policy Optimization (GRPO) for spatial question answering, with explicit KL regularization to avoid policy collapse (Liao et al., 1 Apr 2025).
Geometry surrogate training: Using large-scale geometry datasets (Euclid30K) and reward-based finetuning to endow spatial priors (Lian et al., 29 Sep 2025).

5. Empirical Findings and Persistent Model Limitations

Across diverse VSR datasets and benchmarks, several robust empirical patterns emerge:

Substantial human–model performance gap: For example, on the VSR dataset, human accuracy >95% while top models achieve 70–74% (Liu et al., 2022, Azkune et al., 2024). On SpinBench and Jigsaw-Puzzles, leading VLMs trail humans by 10–20%+ (2505.20728, Zhang et al., 29 Sep 2025).
Orientation and frame-of-reference reasoning remains difficult: Even with large training sets, models underperform on relations requiring orientation or perspective transformations (“facing,” “left of”—intrinsic vs. relative frames) (Liu et al., 2022, Zhang et al., 29 Sep 2025).
Scaling training data and encoder modalities helps but shows diminishing returns: Model accuracy saturates, and bias toward language patterns or co-occurrence is observed; models may prefer “yes” answers regardless of visual detail (Xie et al., 2024).
Superficial or shortcut reasoning in 3D: Datasets such as SURPRISE3D intentionally avoid object names in queries, revealing a sharp drop in model performance, as current models often anchor reasoning on associative linguistic cues instead of geometric inference (Huang et al., 10 Jul 2025).
Improvements via auxiliary spatial supervision: Incorporating explicit geometric priors (depth, 3D, edge maps, spatial tokens, or geometry-based training) consistently boosts performance, but advanced relations (rotation, multi-object layout, spatial planning) remain unsolved (Islam et al., 3 Oct 2025, Lian et al., 29 Sep 2025).

6. Innovations, Open Problems, and Future Directions

Several recent developments have addressed VSR’s challenging aspects:

Automated dataset synthesis and augmentation: Coupling large-scale synthetic scene/text generation with focused annotation strategies (e.g., controlled distractor selection, reference frame labelling, template-based question generation) (Deng et al., 23 Jun 2025, Wang et al., 10 Jul 2025).
Fine-grained spatial task decomposition: Diagnostic protocols such as SpinBench enable targeted assessment of core capacities (e.g., perspective-taking, translation, dynamic rotation), revealing where VLMs are weak (Zhang et al., 29 Sep 2025).
3D spatial segmentation and masking: SURPRISE3D and related resources address the need for spatially grounded, name-agnostic spatial queries, crucial for authentic spatial inference in robotics (Huang et al., 10 Jul 2025).
Geometry-informed curriculum learning: Surrogate training on geometry problems (Euclid30K) enables zero-shot transfer of deductive reasoning and spatial priors, supporting accuracy gains on diverse VSR benchmarks (Lian et al., 29 Sep 2025).

Yet, several open challenges persist:

Bridging perception and higher-level reasoning: There is a pronounced gap between models’ competence in object recognition (seeing) and their ability to perform abstract, compositional spatial reasoning (thinking) (Yu et al., 23 Sep 2025).
Dynamic and temporal-spatial tasks: Models are particularly troubled by long-horizon reasoning (planning in interactive environments, handling video temporal dynamics) (Mayer et al., 5 Feb 2025, Feng, 18 May 2025).
Spatial imagination and generalization: Most models lack the ability to “imagine” unobserved viewpoints or generalize spatial relations to unseen objects or scenarios (Yu et al., 23 Sep 2025).
Unified spatiotemporal frameworks: Moving beyond static 2D input toward spatiotemporally continuous, four-dimensional representations is identified as a future research imperative (Yu et al., 23 Sep 2025).

7. Summary Table: Key VSR Dataset Characteristics

Dataset/Benchmark	Visual Modality	# Tasks / Pairs	Key Spatial Capabilities	Notable Challenges
VSR (Liu et al., 2022)	2D natural	>10,000	66 relations, T/F, ref.frames	Orientation, zero-shot generalization
Touchdown (Chen et al., 2018)	360° city	9,326 tasks	Egocentric/allocentric, navigation & location	Long, complex language; real urban data
SURPRISE3D (Huang et al., 10 Jul 2025)	3D segmentation	200k+	Name-agnostic, relative/absolute/query	3D complexity, explicit spatial queries
InternSpatial (Deng et al., 23 Jun 2025)	2D+3D	12M pairs	Multi-view, rotation, 19 formats	Large scale, modality diversity
Jigsaw-Puzzles (2505.20728)	2D real	1,100 images	Multi-step, open-ended, structure	Severe gap in order generation
SpinBench (Zhang et al., 29 Sep 2025)	Synthetic	51 sub-tasks	Rotation, translation, perspective	Egocentric bias, brittle representations
SIBench (Yu et al., 23 Sep 2025)	Multi	~20 datasets	Perception, understanding, planning	3D, numeric estimation, spatial imagination

References

(Chen et al., 2018) Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
(Aditya et al., 2018) Spatial Knowledge Distillation to aid Visual Reasoning
(Liu et al., 2022) Visual Spatial Reasoning
(Azkune et al., 2024) Grounding Spatial Relations in Text-Only LLMs
(Meng et al., 2024) I Know About "Up"! Enhancing Spatial Reasoning in Visual LLMs Through 3D Reconstruction
(Xie et al., 2024) Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
(Mayer et al., 5 Feb 2025) iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs
(Zhang et al., 14 Mar 2025) EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks
(Stogiannidis et al., 25 Mar 2025) Mind the Gap: Benchmarking Spatial Reasoning in Vision-LLMs
(Liao et al., 1 Apr 2025) Improved Visual-Spatial Reasoning via R1-Zero-Like Training
(Feng, 18 May 2025) Visuospatial Cognitive Assistant
(2505.20728) Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-LLMs
(Deng et al., 23 Jun 2025) InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-LLMs
(Wang et al., 10 Jul 2025) SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
(Huang et al., 10 Jul 2025) SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes
(Liang et al., 28 Jul 2025) Enhancing Spatial Reasoning through Visual and Textual Thinking
(Yu et al., 23 Sep 2025) How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
(Lian et al., 29 Sep 2025) Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-LLMs via Geometric Surrogate Tasks
(Zhang et al., 29 Sep 2025) SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
(Islam et al., 3 Oct 2025) Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

These resources collectively define the state-of-the-art in the evaluation of visual spatial reasoning, offering rigorous, multidimensional testbeds critical for the further development of spatially intelligent machine perception.