VSI-Bench: Vision-Language Spatial Reasoning

Updated 19 November 2025

VSI-Bench is a standardized evaluation suite assessing vision-language models' ability to perform spatial reasoning using egocentric video data across tasks like object counting and route planning.
It categorizes tasks into numerical (e.g., distance or size estimation measured by mean-relative accuracy) and classification (e.g., relative direction with exact-match accuracy) sub-tasks.
Geometry-centric fine-tuning with datasets like Euclid30K leads to notable performance gains, emphasizing the importance of fundamental spatial priors in model evaluation.

Visual Spatial Intelligence Benchmark (VSI-Bench) is a comprehensive evaluation suite for vision-LLMs (VLMs) centered on physical spatial reasoning, relational judgment, and navigation in 3D visual environments. VSI-Bench focuses specifically on assessing models’ ability to parse, conceptualize, and operate on egocentric video data through a set of objectively-marked, spatially-informed question answering (QA) tasks. By standardizing datasets, metrics, and sub-task divisions, VSI-Bench enables detailed, quantitative comparisons of spatial reasoning capacity across a wide range of multimodal models and approaches (Lian et al., 29 Sep 2025).

1. Dataset Structure and Task Taxonomy

VSI-Bench comprises approximately 5,130 video question–answer pairs, each constructed from short egocentric video sequences and sourced from three major real-world 3D scan datasets: ARKitScenes, ScanNet, and ScanNet++. Each sample consists of a 32-frame video, uniformly sampled across varied trajectories and environments, together with a natural language question targeting fine-grained spatial properties, relations, or events within the scene.

The overall QA tasks are partitioned into eight sub-tasks, organized into two broad classes:

Numerical Sub-Tasks (regression, mean-relative-accuracy as metric):

Object counting
Absolute distance estimation
Object-size estimation
Room-size estimation

Multiple-Choice Sub-Tasks (classification, exact-match accuracy as metric):

Relative distance (“which object is closer?”)
Relative direction (“which direction is X from Y?”)
Route planning (“which path?”)
Appearance order (“which object was seen first/last?”)

This sub-task partition covers both metric spatial computation and relational reasoning, as well as pseudo-temporal aspects via appearance order.

2. Evaluation Metrics and Protocols

Numerical sub-tasks use mean relative accuracy (MRA), computed across a set of confidence thresholds $\mathcal{C} = \{0.5, 0.55, ..., 0.95\}$ to quantify proximity between model prediction and ground-truth answer.

For classification sub-tasks, exact-match accuracy is reported: $\text{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[a_i = \hat{a}_i]$ where $a_i$ is the true answer and $\hat{a}_i$ is the model’s prediction.

Zero-shot evaluation is employed for most benchmarked systems, where models receive only the video frames and associated question as input. The standard evaluation pipeline uses lmms-eval toolkit with a temperature of 0 and a maximum output sequence of 1024 tokens per QA instance, and employs prompt engineering with explicit chain-of-thought (<think>, \boxed{} tags) to elicit stepwise spatial inference (Lian et al., 29 Sep 2025).

3. Baseline Model Performance

VSI-Bench enables direct comparison of proprietary, open-source, and specialized spatially-tuned VLMs:

Model (family/size)	Zero-Shot VSI-Bench Accuracy (%)
@@@@10@@@@ (OpenAI, proprietary)	34.0
Gemini-1.5 Pro	48.8
Gemini-2.0 Flash	45.4
LongVA-7B	29.2
Qwen2.5VL-3B (open-source)	29.2
Qwen2.5VL-7B	24.8
RoboBrain2.0-7B	43.0
RoboBrain2.0-32B	43.1
VILA-1.5-40B	31.2
LLaVA-OneVision-72B	40.2
LLaVA-Video-72B	40.9
Spatial-MLLM-4B (prior SOTA)	48.4
M2-Reasoning-7B	42.3

Results exhibit substantial variability, with proprietary or specially fine-tuned models attaining near 49% accuracy and most general-purpose VLMs remaining under 40%. Route planning and spatial relation subtasks were comparatively strong for spatially tuned systems, while appearance order (temporal) was not as improved (Lian et al., 29 Sep 2025).

4. Impact of Geometry-Centric Surrogate Task Fine-Tuning

Fine-tuning with geometry-focused surrogate tasks, specifically via the Euclid30K dataset (≈30,000 multimodal geometry problems with symbolic and numeric answers), leads to statistically robust gains. Models such as RoboBrain2.0-Euclid-7B achieve 49.6% accuracy, surpassing prior state-of-the-art Spatial-MLLM-4B (48.4%) and proprietary Gemini-1.5 Pro (48.8%). Improvements are consistent across all parameter scales, averaging +2 to +7 percentage points (Lian et al., 29 Sep 2025).

Per-subtask gains for RoboBrain2.0-Euclid-7B after Euclid30K fine-tuning:

VSI-Bench Sub-Task	Pretrain Accuracy (%)	Post-Euclid30K Accuracy (%)
Object Counting	46.0	66.4
Abs. Distance	32.7	36.9
Object Size	58.9	66.3
Room Size	35.9	40.5
Rel. Distance	45.9	48.3
Rel. Direction	41.5	45.3
Route Planning	30.9	35.6
Appearance Order	55.2	57.8

These improvements are causally linked to the geometry-specific curriculum: ablation studies with matched-size, non-geometry spatial QA (Clevr-CoGenT) show 1–4 percentage point lower gains. The results suggest that transfer arises from the acquisition of fundamental Euclidean axioms and spatial priors (congruence, parallelism, metric inference), rather than generic multi-task RL or broader dataset scaling (Lian et al., 29 Sep 2025).

A notable limitation emerges for sub-tasks with an explicitly temporal component (e.g., appearance order), which saw only marginal improvement from static-geometry pretraining—indicating the need for further inclusion of spatio-temporal or 3D-rotation prior data for comprehensive coverage.

5. Technical Implementation and Model Training Regime

VSI-Bench-compliant fine-tuning pipelines for geometry-centric transfer employ Group Relative Policy Optimization (GRPO) with the following settings:

Reward Design:
- Symbolic answers: MathVerify for LaTeX equivalence.
- Numeric answers: score of 1 if relative error ≤1%, 0 otherwise.
- Multiple-choice: 1 for exact match.
Optimization Objective:

$\mathcal{L}_{\rm GRPO}(\theta) = \mathbb{E}_{q,o_i}\left[ \mathcal{J}(\theta) - \beta \,\mathrm{KL}[ \pi_\theta \| \pi_{\rm ref} ] \right]$

$\mathcal{J}(\theta)$ : token-level PPO reward function.

Training Details:
- 10 epochs on 64 × NVIDIA H100 GPUs
- Adam optimizer (1e-6 learning rate, 1e-2 weight decay)
- PPO clip $\epsilon=0.2$ , KL-coefficient $\beta=10^{-2}$
- 8 rollouts per question, actor batch 128, max-grad-norm 1.0
- Image resolution between $512 \times 512$ and $2048 \times 2048$
Evaluation:
- All VSI-Bench results are zero-shot—no further adaptation is performed after surrogate task fine-tuning.
- Prompting with explicit chain-of-thought (CoT) format is critical for extracting stepwise reasoning.

6. Role in the Broader Spatial Reasoning Landscape

VSI-Bench sits alongside other spatial and spatial–temporal benchmarks (e.g., Super-CLEVR, Omni3DBench, MindCube), but distinguishes itself via its strong orientation toward egocentric 3D video, real-world scan data, and the amalgamation of metric, relational, and route-planning subtasks. The observed transfer from geometry-based pretraining via Euclid30K is robust even when compared to large-scale non-geometry multimodal QA (as shown in causal ablation, Table 5) (Lian et al., 29 Sep 2025).

These findings suggest that fundamental geometric axioms and deductive reasoning skills provide transferrable priors directly conducive to a wide array of spatial inference and navigation tasks presented in VSI-Bench.

7. Limitations and Future Directions

Despite robust gains from geometry-centric surrogate learning, VSI-Bench evaluations suggest persistent open challenges. Specifically, mental-rotation tasks and subtasks requiring explicit temporal memory benefit less from geometry-heavy pretraining, likely due to the predominance of plane-geometry problems and the lack of temporal or 3D-rotation representation within the current Euclid30K corpus. A plausible implication is that an augmented curriculum incorporating more solid (3D) geometry and temporally-dependent spatial tasks may yield further advances in spatio-temporal intelligence (Lian et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VSI-Bench.