Papers
Topics
Authors
Recent
2000 character limit reached

VSI-Bench: Vision-Language Spatial Reasoning

Updated 19 November 2025
  • VSI-Bench is a standardized evaluation suite assessing vision-language models' ability to perform spatial reasoning using egocentric video data across tasks like object counting and route planning.
  • It categorizes tasks into numerical (e.g., distance or size estimation measured by mean-relative accuracy) and classification (e.g., relative direction with exact-match accuracy) sub-tasks.
  • Geometry-centric fine-tuning with datasets like Euclid30K leads to notable performance gains, emphasizing the importance of fundamental spatial priors in model evaluation.

Visual Spatial Intelligence Benchmark (VSI-Bench) is a comprehensive evaluation suite for vision-LLMs (VLMs) centered on physical spatial reasoning, relational judgment, and navigation in 3D visual environments. VSI-Bench focuses specifically on assessing models’ ability to parse, conceptualize, and operate on egocentric video data through a set of objectively-marked, spatially-informed question answering (QA) tasks. By standardizing datasets, metrics, and sub-task divisions, VSI-Bench enables detailed, quantitative comparisons of spatial reasoning capacity across a wide range of multimodal models and approaches (Lian et al., 29 Sep 2025).

1. Dataset Structure and Task Taxonomy

VSI-Bench comprises approximately 5,130 video question–answer pairs, each constructed from short egocentric video sequences and sourced from three major real-world 3D scan datasets: ARKitScenes, ScanNet, and ScanNet++. Each sample consists of a 32-frame video, uniformly sampled across varied trajectories and environments, together with a natural language question targeting fine-grained spatial properties, relations, or events within the scene.

The overall QA tasks are partitioned into eight sub-tasks, organized into two broad classes:

  • Numerical Sub-Tasks (regression, mean-relative-accuracy as metric):
  1. Object counting
  2. Absolute distance estimation
  3. Object-size estimation
  4. Room-size estimation
  • Multiple-Choice Sub-Tasks (classification, exact-match accuracy as metric):
  1. Relative distance (“which object is closer?”)
  2. Relative direction (“which direction is X from Y?”)
  3. Route planning (“which path?”)
  4. Appearance order (“which object was seen first/last?”)

This sub-task partition covers both metric spatial computation and relational reasoning, as well as pseudo-temporal aspects via appearance order.

2. Evaluation Metrics and Protocols

Numerical sub-tasks use mean relative accuracy (MRA), computed across a set of confidence thresholds C={0.5,0.55,...,0.95}\mathcal{C} = \{0.5, 0.55, ..., 0.95\} to quantify proximity between model prediction and ground-truth answer.

For classification sub-tasks, exact-match accuracy is reported: Acc=1Ni=1N1[ai=a^i]\text{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[a_i = \hat{a}_i] where aia_i is the true answer and a^i\hat{a}_i is the model’s prediction.

Zero-shot evaluation is employed for most benchmarked systems, where models receive only the video frames and associated question as input. The standard evaluation pipeline uses lmms-eval toolkit with a temperature of 0 and a maximum output sequence of 1024 tokens per QA instance, and employs prompt engineering with explicit chain-of-thought (<think>, \boxed{} tags) to elicit stepwise spatial inference (Lian et al., 29 Sep 2025).

3. Baseline Model Performance

VSI-Bench enables direct comparison of proprietary, open-source, and specialized spatially-tuned VLMs:

Model (family/size) Zero-Shot VSI-Bench Accuracy (%)
GPT-4o (OpenAI, proprietary) 34.0
Gemini-1.5 Pro 48.8
Gemini-2.0 Flash 45.4
LongVA-7B 29.2
Qwen2.5VL-3B (open-source) 29.2
Qwen2.5VL-7B 24.8
RoboBrain2.0-7B 43.0
RoboBrain2.0-32B 43.1
VILA-1.5-40B 31.2
LLaVA-OneVision-72B 40.2
LLaVA-Video-72B 40.9
Spatial-MLLM-4B (prior SOTA) 48.4
M2-Reasoning-7B 42.3

Results exhibit substantial variability, with proprietary or specially fine-tuned models attaining near 49% accuracy and most general-purpose VLMs remaining under 40%. Route planning and spatial relation subtasks were comparatively strong for spatially tuned systems, while appearance order (temporal) was not as improved (Lian et al., 29 Sep 2025).

4. Impact of Geometry-Centric Surrogate Task Fine-Tuning

Fine-tuning with geometry-focused surrogate tasks, specifically via the Euclid30K dataset (≈30,000 multimodal geometry problems with symbolic and numeric answers), leads to statistically robust gains. Models such as RoboBrain2.0-Euclid-7B achieve 49.6% accuracy, surpassing prior state-of-the-art Spatial-MLLM-4B (48.4%) and proprietary Gemini-1.5 Pro (48.8%). Improvements are consistent across all parameter scales, averaging +2 to +7 percentage points (Lian et al., 29 Sep 2025).

Per-subtask gains for RoboBrain2.0-Euclid-7B after Euclid30K fine-tuning:

VSI-Bench Sub-Task Pretrain Accuracy (%) Post-Euclid30K Accuracy (%)
Object Counting 46.0 66.4
Abs. Distance 32.7 36.9
Object Size 58.9 66.3
Room Size 35.9 40.5
Rel. Distance 45.9 48.3
Rel. Direction 41.5 45.3
Route Planning 30.9 35.6
Appearance Order 55.2 57.8

These improvements are causally linked to the geometry-specific curriculum: ablation studies with matched-size, non-geometry spatial QA (Clevr-CoGenT) show 1–4 percentage point lower gains. The results suggest that transfer arises from the acquisition of fundamental Euclidean axioms and spatial priors (congruence, parallelism, metric inference), rather than generic multi-task RL or broader dataset scaling (Lian et al., 29 Sep 2025).

A notable limitation emerges for sub-tasks with an explicitly temporal component (e.g., appearance order), which saw only marginal improvement from static-geometry pretraining—indicating the need for further inclusion of spatio-temporal or 3D-rotation prior data for comprehensive coverage.

5. Technical Implementation and Model Training Regime

VSI-Bench-compliant fine-tuning pipelines for geometry-centric transfer employ Group Relative Policy Optimization (GRPO) with the following settings:

  • Reward Design:
    • Symbolic answers: MathVerify for LaTeX equivalence.
    • Numeric answers: score of 1 if relative error ≤1%, 0 otherwise.
    • Multiple-choice: 1 for exact match.
  • Optimization Objective:

LGRPO(θ)=Eq,oi[J(θ)βKL[πθπref]]\mathcal{L}_{\rm GRPO}(\theta) = \mathbb{E}_{q,o_i}\left[ \mathcal{J}(\theta) - \beta \,\mathrm{KL}[ \pi_\theta \| \pi_{\rm ref} ] \right]

J(θ)\mathcal{J}(\theta): token-level PPO reward function.

  • Training Details:
    • 10 epochs on 64 × NVIDIA H100 GPUs
    • Adam optimizer (1e-6 learning rate, 1e-2 weight decay)
    • PPO clip ϵ=0.2\epsilon=0.2, KL-coefficient β=102\beta=10^{-2}
    • 8 rollouts per question, actor batch 128, max-grad-norm 1.0
    • Image resolution between 512×512512 \times 512 and 2048×20482048 \times 2048
  • Evaluation:
    • All VSI-Bench results are zero-shot—no further adaptation is performed after surrogate task fine-tuning.
    • Prompting with explicit chain-of-thought (CoT) format is critical for extracting stepwise reasoning.

6. Role in the Broader Spatial Reasoning Landscape

VSI-Bench sits alongside other spatial and spatial–temporal benchmarks (e.g., Super-CLEVR, Omni3DBench, MindCube), but distinguishes itself via its strong orientation toward egocentric 3D video, real-world scan data, and the amalgamation of metric, relational, and route-planning subtasks. The observed transfer from geometry-based pretraining via Euclid30K is robust even when compared to large-scale non-geometry multimodal QA (as shown in causal ablation, Table 5) (Lian et al., 29 Sep 2025).

These findings suggest that fundamental geometric axioms and deductive reasoning skills provide transferrable priors directly conducive to a wide array of spatial inference and navigation tasks presented in VSI-Bench.

7. Limitations and Future Directions

Despite robust gains from geometry-centric surrogate learning, VSI-Bench evaluations suggest persistent open challenges. Specifically, mental-rotation tasks and subtasks requiring explicit temporal memory benefit less from geometry-heavy pretraining, likely due to the predominance of plane-geometry problems and the lack of temporal or 3D-rotation representation within the current Euclid30K corpus. A plausible implication is that an augmented curriculum incorporating more solid (3D) geometry and temporally-dependent spatial tasks may yield further advances in spatio-temporal intelligence (Lian et al., 29 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VSI-Bench.