Papers
Topics
Authors
Recent
2000 character limit reached

MMSI-Video-Bench: Video Spatial Intelligence Benchmark

Updated 18 December 2025
  • MMSI-Video-Bench is a human-annotated benchmark evaluating video-based spatial intelligence via a hierarchical framework covering perception, planning, prediction, and cross-video reasoning.
  • It compiles 1,106 challenging questions from 1,278 clips across diverse domains, rigorously reviewed by 3D vision experts with detailed explanatory rationales.
  • The benchmark reveals significant gaps in current MLLMs, outlines error taxonomies, and motivates research in adaptive frame-sampling and native 3D integration.

MMSI-Video-Bench is a fully human-annotated multiple-choice benchmark designed to rigorously evaluate the video-based spatial intelligence of multimodal LLMs (MLLMs). Developed to address deficiencies in prior benchmarks—which typically focus on static images or template-generated questions—MMSI-Video-Bench grounds 1,106 challenging questions in 1,278 video clips, curated from both public datasets and in-house recordings. Structured around a hierarchical four-level framework incorporating Perception, Planning, Prediction, and Cross-Video Reasoning, the benchmark diagnoses embodied spatial reasoning capabilities critical for general-purpose AI assistants in physical environments. All items are reviewed by 3D vision experts and include explanatory rationales to ensure stringent grounding and disambiguation. MMSI-Video-Bench further supports domain-oriented sub-benchmarks for targeted evaluation and provides a systematic testbed for benchmarking state-of-the-art MLLMs, revealing significant gaps between human and AI spatial intelligence (Lin et al., 11 Dec 2025).

1. Spatial Intelligence Framework

MMSI-Video-Bench operationalizes video-based spatial intelligence through four hierarchically organized levels:

  1. Perception This level emphasizes:
    • Spatial Construction: Inferring the positions, orientations, and shapes of entities, as well as pairwise spatial relations (e.g., front/back, near/far) at fixed timestamps.
    • Motion Understanding: Reasoning about camera movement, individual and inter-instance motions across time.

Performance on these subtasks is measured as exact-match accuracy:

A=#{correct predictions}#{total questions}A = \frac{\# \{\text{correct predictions}\}}{\# \{\text{total questions}\}}

  1. Planning Assesses goal-driven decision-making, where models must select actions—such as determining which door to open for a robot based on sparse video cues in an unfamiliar environment.
  2. Prediction Involves forecasting spatial states or outcomes under hypothetical or future conditions, demanding integration of physical priors, including inertia and occlusion, with observed dynamics.
  3. Cross-Video Reasoning Measures the ability to:
    • Memory Update: Retain/update scene knowledge from temporally discontinuous video segments.
    • Multi-View Integration: Merge streams from divergent viewpoints to build unified spatial representations.

This layered approach captures the multifaceted demands on embodied video spatial intelligence essential for real-world AI agents.

2. Dataset Construction and Characteristics

MMSI-Video-Bench anchors its 1,106 questions in 1,278 clips—amassing over 400 human annotation hours. Composition is as follows:

  • Sources: 25 public datasets (e.g., ScanNet, Matterport3D, Waymo, nuScenes, Ego4D, EPIC-KITCHENS, MultiSports, DROID, RH20T) and 140 anonymized in-house videos.
  • Standardization: Clips average 72 seconds, standardized and downsampled to preserve critical events while maintaining feasible computational requirements.
  • Sampling Diversity: Spans indoor scans, outdoor driving, egocentric operations, human activities, and robotics manipulations.
Dataset Type # Clips FPS Avg. Duration (s)
Indoor Scan (ScanNet, etc.) 450 1–2 100
Outdoor Env. (Waymo, etc.) 300 4–5 20
Ego-Int. (Ego4D, EPIC) 200 2–8 130
Exo-HA (MultiSports) 150 2–12 25
Robotics (DROID, RH20T) 178 4 85

Annotation protocols involve rigorous expert review, with explanatory rationales supporting precise, unambiguous question grounding (Lin et al., 11 Dec 2025).

3. Sub-Benchmarks and Task Taxonomy

To facilitate targeted diagnostics, MMSI-Video-Bench defines three domain-specific sub-benchmarks:

1. Indoor Scene Perception Bench (523 samples)

  • Static Instance-Centric: Tests object attributes and inter-object relations invariant to viewpoint.
  • Static Camera-Centric: Assesses relations between the camera and scene (e.g., relative positioning).
  • Dynamic Scene: Captures state changes from human actions or object rearrangement.

2. Robot Bench (204 samples)

  • Manipulation: Focuses on fine-grained reasoning about object-level interactive motions.
  • Navigation: Challenges path-planning in indoor spaces governed by spatial constraints.

3. Grounding Bench (335 samples)

  • Target Grounding: Requires spatially driven object localization beyond referential tagging.
  • Temporal Localization: Demands identification of precise time segments via spatial cues.

This structuring ensures holistic coverage while enabling precise evaluation of spatial, temporal, and action-centered video intelligence (Lin et al., 11 Dec 2025).

4. Evaluation Protocol and Baselines

MMSI-Video-Bench evaluates both proprietary and open-source MLLMs, including GPT-4o, Gemini 3 Pro/2.5 Flash, O3, GPT-5, Claude-haiku, InternVL, QwenVL, LLaVA-Video, Seed-1.6-Vision, and Doubao-1.5. Evaluation is performed under two settings:

  • Uniform-50: Exactly 50 uniformly sampled frames per video.
  • Sufficient-Coverage: All frames used in human annotation.

Metrics:

  • Exact-Match Accuracy: An answer is correct only if identical to the ground truth.
  • Random Guessing Baseline: Approximates 24% (inverse of the average number of answer options).
  • Human Baseline: Human annotators achieve 96.4% accuracy, establishing the ceiling.

Notably, performance is reported for key subcategories (Spatial, Motion, Overall Average). The best MLLMs attain accuracy levels close to chance, significantly lagging human reasoning—e.g., Gemini 3 Pro (38.0%), GPT-5 (36.8%). No substantial gain is observed from using the full suite of frames; Uniform-50 often matches or slightly outperforms Sufficient-Coverage, reflecting information redundancy (Lin et al., 11 Dec 2025).

5. Error Taxonomy and Model Failures

Analysis of 520 errors by four representative models (GPT-4o, Gemini 2.5 Flash, O3, QwenVL2.5-72B) yields five non-overlapping categories:

  1. Detailed Grounding Error: Failure to identify/track objects or actions at specific frames.
  2. ID Mapping Error: Confusion in maintaining consistent object identities across temporally separated frames.
  3. Geometric Reasoning Error: Mistakes in inferring spatial relationships such as front/back and near/far.
  4. Prompt Alignment Error: Incorrect interpretation of question precursors or prompts.
  5. Latent Logical Inference Error: Omission in integrating indirect cues or commonsense inferences.

Geometric reasoning errors dominate static spatial construction tasks, while grounding and logical inference errors are particularly frequent in cross-video correspondence. Subtler or long-horizon motion tasks exacerbate grounding inaccuracies, and planning/prediction tasks often reveal prompt alignment deficits. The systematic nature of these errors underscores fundamental limitations in MLLM spatial reasoning capabilities (Lin et al., 11 Dec 2025).

6. Methodological Observations

Analysis of methodology produces several key findings:

  • Frame Sampling: Increasing uniform coverage from 1 to 50 frames markedly improves MLLM performance; uniform sampling outperforms contiguous segment choices. Adaptive Keyframe Sampling (AKS), reliant on semantic similarity, fails to improve results (e.g., GPT-4o, 31.6% vs AKS-50, 28.4%), suggesting that critical spatial reasoning frames are not easily identifiable by standard semantic metrics.
  • 3D Spatial Cues: Integrating VGGT-generated multi-view point-cloud renders (10 views/clip) as auxiliary model input increases performance by less than 1% for all four tested models, indicating a systemic issue in leveraging explicit 3D information without architectural specialization.
  • Chain-of-Thought (CoT) Prompting: A three-step CoT schema (Analyze→Gather Evidence→Solve) does not yield consistent improvements (<1%), highlighting intrinsic model limitations rather than prompt deficiencies (Lin et al., 11 Dec 2025).

7. Limitations and Research Directions

Documented limitations include:

  • Generalization Gap: Models spatially fine-tuned on alternative datasets such as SpaceQwen, Spatial-MLLM, and VLM3R do not generalize—sometimes performing worse—on MMSI-Video-Bench, emphasizing stark domain shift.
  • Tool Reliance: Explicit spatial geometry tools (e.g., VGGT) remain brittle, particularly in dynamic or structurally complex scenes, indicating insufficient 3D reconstruction integration.
  • Episodic Memory and Reasoning: Cross-Video tasks expose the lack of comprehensive long-term video memory in current MLLMs.
  • Open Research Avenues:
    • Adaptive, reasoning-aware frame-sampling algorithms.
    • Architectures fusing 3D spatial representations natively with language-based reasoning.
    • Grounded CoT methodologies that enforce stepwise evidence anchoring.
    • Deeper integration of video-language pretraining and embodied policy learning, especially for decision-centric planning scenarios.

MMSI-Video-Bench establishes a demanding new standard for benchmarking and diagnosis, catalyzing advances in embodied video spatial intelligence and highlighting persistent gaps between current AI and human perceptual-planning faculties (Lin et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MMSI-Video-Bench.