VSI-Bench: 3D Spatial Reasoning Benchmark
- Visual Spatial Intelligence Benchmark (VSI-Bench) is a large-scale, video-based evaluation suite that measures multimodal models' ability to perceive, memorize, and reason about 3D spaces.
- It comprises eight spatial tasks with over 5,000 QA pairs drawn from diverse indoor environments, rigorously probing configurational, metric, and spatiotemporal reasoning.
- The benchmark highlights current limitations in global scene integration while demonstrating that explicit cognitive mapping can enhance local spatial accuracy.
Visual Spatial Intelligence Benchmark (VSI-Bench) is a large-scale, video-based evaluation suite that characterizes the capacity of multimodal LLMs (MLLMs) to perceive, memorize, and reason about 3D spaces from egocentric video sequences. VSI-Bench is designed to be visually grounded and to measure genuine spatial reasoning in the absence of language or world-knowledge shortcuts. Its core contribution is a set of eight spatial intelligence tasks and over 5,000 question–answer pairs derived from annotated real-world RGB-D video datasets, providing a rigorous substrate for measuring and advancing spatial intelligence in vision–LLMs (Yang et al., 2024).
1. Dataset Composition and Benchmark Structure
VSI-Bench consists of 288 real-world indoor video sequences, drawn from ScanNet (88 videos), ScanNet++ (50 videos), and ARKitScenes (150 videos), each re-encoded to continuous 24–30 fps RGB at a resolution of 640×480. While the original data includes RGB-D streams and reconstructed 3D meshes, only RGB frames are given to models; 3D and depth information is used offline for QA generation and evaluation. The spatial diversity includes residential, professional, and industrial spaces across multiple geographic regions. The full benchmark comprises 5,060 QA pairs.
Each QA pair targets a distinct spatial relation, measurement, or navigation property. Task coverage includes configurational understanding (object count, relative distance/direction, navigation steps), metric estimation (object/room size, absolute distance), and spatiotemporal reasoning (appearance order of objects).
2. Task Taxonomy and Probing Dimensions
VSI-Bench organizes its spatial cognition challenges into eight distinct tasks, grouped by the cognitive primitive they probe:
- Configurational (Multiple-Choice):
- Object Counting (“How many chairs are in this room?”)
- Relative Distance (“Which of {A, B, C, D} is closest to the TV?”)
- Relative Direction (three levels: binary, trinary, and quaternary sectors)
- Route Planning (“Fill in turn steps from bed to toilet…”)
- Measurement Estimation (Numerical):
- Object Size (“Length of sofa in centimeters”)
- Room Size (“Area of room in square meters”)
- Absolute Distance (“Distance between couch and window in meters”)
- Spatiotemporal:
- Appearance Order (“In what order do table, vase, lamp, chair first appear?”)
This structure probes a range from local geometric relations (adjacency, direction) to global layout (multi-step path planning, metric estimation) and temporally extended memory (object orderings).
3. QA Generation and Quality Control Pipeline
VSI-Bench adopts a hybrid, schema-driven annotation process:
- Meta-information Unification:
Object categories, bounding boxes, and reconstructed room point clouds are parsed into a unified schema. Rare or tiny objects are filtered and categories standardized.
- Template-based Auto-Annotation:
For all but Route Planning, QA is generated automatically by instantiating templates over scene annotations. For example, absolute distances are measured by finding the minimum Euclidean separation between random points in object bounding boxes; room area is computed using an alpha-shape polygon on the reconstructed floor mesh.
- Human Annotation for Route Planning:
Short, two-to-five step navigation problems are designed by annotators, who then convert part of each route into fill-in-the-blank sequences.
- Ambiguity Filtering and Review:
Heuristic thresholds on inter-object distances, angular separation, and temporal gaps filter ambiguous cases, followed by a human-in-the-loop review (blind annotation to ensure absence of answer leakage).
4. Evaluation Metrics and Scoring Methodology
VSI-Bench defines two classes of evaluation metric:
- Multiple-Choice Answer (MCA) Tasks:
Measured using standard accuracy:
- Numerical Answer (NA) Tasks:
Quantified by Mean Relative Accuracy (MRA). For ground-truth and prediction , the relative error is . For thresholds :
This evaluates the fraction of samples within progressively looser error tolerances.
For route planning, exact-match accuracy over the filled instruction sequence is reported.
5. Baseline Performance and Cognitive Mapping Technique
Human annotator performance establishes an upper bound: 79.2% average accuracy (configurational tasks 94–100%). The best proprietary MLLM (Gemini-1.5 Pro) achieves 48.8% overall; GPT-4o trails at 34.0%. Competitive open-source models (e.g., LLaVA-NeXT-Video-72B) achieve approximately 40.9%.
A cognitive mapping protocol, inspired by mental imagery literature, prompts MLLMs to explicitly generate a “map” by placing object category centers on a 10×10 grid. Models are asked to output a JSON dictionary of predicted centers for each object. Evaluation considers pairwise grid distances; local (adjacent) object relations are predicted correctly around 64% of the time, but accuracy declines for more global structures. When answering Relative Distance questions, two-stage prompting via explicit map construction boosts accuracy from 46% to 56%; when supplied with ground-truth maps, performance rises to 66%.
| Model | Count | AbsDist | ObjSize | RoomSize | RelDist | RelDir | RoutePlan | ApprOrder | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Human | 94.3 | 47.0 | 60.4 | 45.9 | 94.7 | 95.8 | 95.8 | 100.0 | 79.2 |
| Gemini-1.5 Pro | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 | 45.4 |
| LLaVA-NeXT-Video-72B | 48.9 | 22.8 | 57.4 | 35.3 | 42.4 | 36.7 | 35.0 | 48.6 | 40.9 |
6. Limitations, Failure Modes, and Insights
Analysis of error types and failure clusters reveals:
- Spatial Reasoning Bottleneck:
Approximately 71% of errors are attributed to challenges in relational reasoning or egocentric/allocentric frame transformation, rather than object perception or linguistic misunderstanding.
- Inefficacy of Chained Language Reasoning:
Standard linguistic reasoning methods (chain-of-thought, self-consistency, tree-of-thoughts) degrade performance by 1–4%, indicating that failures are not due to lack of explicit linguistic chaining but to missing spatial world modeling.
- Local Spatial Awareness, Global Structure Weakness:
MLLMs exhibit robust local spatial coherence (adjacent object relations, immediate metric estimates), but perform poorly on tasks demanding integration of global scene geometry or extended planning (multi-step navigation/planning <40% accuracy, even for Gemini-1.5 Pro).
- Performance Impact of Explicit Spatial Representations:
Prompting models to produce explicit “mental maps” can meaningfully improve answers on relation-based tasks, but the effect is mostly local; global layout and long-horizon transformations remain major limitations.
Illustrative model behaviors include correct egocentric-to-allocentric transformations in kitchen scenes, and persistent failures when a bedroom camera pan is misinterpreted as an allocentric layout due to purely egocentric chaining.
7. Context within the Broader Spatial Intelligence Benchmark Landscape
VSI-Bench is a foundational benchmark for evaluating spatial intelligence in MLLMs and serves as a template for subsequent work on visual–spatial reasoning (Yang et al., 2024). Unlike image-centric VQA benchmarks or text-based spatial tests, VSI-Bench focuses on video-based, metric 3D cognition with minimal linguistic shortcutting.
Recent studies (e.g., SITE (Wang et al., 8 May 2025) and MMSI-Video-Bench (Lin et al., 11 Dec 2025)) have extended the probe space to broader taxonomies and more comprehensive sub-benchmarks. Critically, VSI-Bench has catalyzed the development of evaluation and debiasing tools that quantify and filter exploitable non-visual shortcuts (e.g., test-set stress tests), yielding derivative debiased versions that better isolate visual reasoning (Brown et al., 6 Nov 2025).
Advances in model architectures—such as those leveraging large-scale, diverse spatial datasets and explicit cognitively motivated map modules—have shown improvements, but the human-model performance gap remains significant (often >30% in absolute accuracy terms) (Yang et al., 2024).
VSI-Bench establishes a rigorous, metric-standardized evaluation substrate for emergent visual–spatial intelligence in MLLMs, sharply delineating current model limitations in spatial world modeling, relational reasoning, and global scene integration. As research iterates on world model objectives, map-centric memory modules, and richer, more robust spatial pre-training, VSI-Bench remains a central instrument for progress measurement and comparative analysis in the field of embodied visual intelligence.