VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Published 2 Apr 2026 in cs.CV and cs.MM | (2604.01569v1)

Abstract: Recent video multimodal LLMs achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces VideoZeroBench, a benchmark that rigorously tests video MLLMs on fine-grained evidence verification with spatio-temporal grounding, revealing significant shortcomings.
It employs a five-level evaluation protocol that decouples answer generation from evidence localization, exposing critical model deficiencies in spatial and temporal reasoning.
Empirical studies on 17 models show even advanced systems struggle to exceed 1% accuracy in fully grounded responses, highlighting a key challenge in video understanding.

VideoZeroBench: A Capability-Centric Benchmark for Fine-Grained, Evidence-Grounded Video MLLM Evaluation

Motivation and Benchmark Design

Despite the proliferation of multimodal LLMs (MLLMs) and their apparent success on various video understanding benchmarks, fundamental questions remain regarding genuine model capabilities. Prevailing evaluations measure answer correctness in QA, often with relatively short videos, neglecting the critical aspect of whether predictions are grounded in the actual spatio-temporal evidence that supports them. Consequently, deficiencies in fine-grained perception, reasoning, or evidence localization are obscured.

To address this, "VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification" (2604.01569) introduces VideoZeroBench, a rigorous hierarchical benchmark specifically designed to expose the boundaries of current video MLLMs. It does so through three central innovations: (1) open-ended, capability-centric QA in long, complex videos; (2) exhaustive manual annotation of questions alongside temporally- and spatially-grounded supporting evidence; (3) a five-level evaluation protocol decoupling answer generation and evidence grounding, progressing from QA with hints to stringent answer+grounding requirements.

VideoZeroBench encompasses 500 manually curated questions across 138 long videos (>667s on average) from 13 diverse domains, annotated for 11 atomic visual and reasoning capabilities—including counting, small-object perception, orientation discrimination, and multi-segment reasoning (Figure 1).

Figure 1: Data construction and statistics for VideoZeroBench, covering 13 domains and 11 atomic abilities, as well as statistics on video length and minimal evidence span distributions.

Five-Level Evaluation Protocol

To ensure that evaluation does not conflate correct guessing or hallucination with authentic understanding, VideoZeroBench incorporates a five-level hierarchical testing regime:

Level-1: QA with both correct temporal and spatial evidence given.
Level-2: QA with only temporal evidence provided.
Level-3: Standard end-to-end QA (no evidence hints).
Level-4: Correct answer and temporally grounded evidence required (temporal IoU > 0.3).
Level-5: Correct answer with both accurate temporal and spatial grounding (temporal IoU > 0.3, visual IoU > 0.3).

This protocol enables structured attribution of errors to reasoning failure, weak temporal search, or poor spatial localization. Level-5 captures the most rigorous, evidence-grounded setting for trustworthy model evaluation.

Empirical Study: Performance of State-of-the-Art Video MLLMs

The benchmark evaluates 17 leading architectures, including proprietary foundation models (Gemini-3/2.5, Seed-2.0, GPT-5.2), open-source baselines (Qwen3.5, Qwen3-VL, InternVL3.5), and reasoning-specialist models (Video-R1, VideoChat-R1.5, Video-o3, Open-o3-Video). Key outcomes indicate a systemic failure in both answering and grounding:

Under standard QA (Level-3), the best-performing model (Gemini-3-Pro) reaches only 17% accuracy, with leading open-source at ∼10%.
Under full spatio-temporal grounding requirements (Level-5), no model exceeds 1% accuracy, and most are at zero.

Performance degrades monotonically from Level-3 to Level-5 for all models, highlighting a severe gap: even when correct answers are produced, models rarely identify authentic supporting evidence, indicating frequent reliance on hallucination or pattern-matching rather than robust reasoning. This bottleneck persists even when evidence hints are provided—Level-1 accuracy remains below 30%, suggesting limited exploitation of explicit cues and integration of fine-grained signals.

Figure 2: Comparative performance across atomic abilities, categories, and evidence spans.

Failure Modes, Fine-Grained Analysis, and Insightful Examples

Detailed dissection by atomic abilities confirms acute deficiencies:

Small-object perception and spatial orientation: Both are notably weak (Gemini-3-Pro ~12%), especially in challenging categories like driving.
Counting and multi-segment integration: Models underperform on tasks requiring aggregation across scenes or frames.
Agentic, multi-round reasoning (e.g., VideoChat-R1.5, Video-o3): Provides marginal gains (<2% improvement in QA), but no significant improvement for evidence grounding.

Furthermore, single-frame answerable questions are not universally easier: fleeting, crucial evidence frames in long videos are commonly missed, especially under uniform sampling strategies. Audio-visual integration remains primitive; full-video input boosts audio perception yet often reduces fine-grained visual task performance.

Figure 3: Representative VideoZeroBench scenarios with ground-truth evidence and MLLM predictions, illustrating common reasoning and grounding errors, such as mislocalization, spatial misinterpretation, fragmented evidence integration across temporal segments, and inability to combine audio and visual cues.

Failures are not limited to inaccurate evidence localization but extend to cases where correct answers are produced by guessing without corresponding grounding. Error analysis across multiple figures demonstrates models misinterpreting spatial relationships, counting under occlusion, and neglecting transient, yet essential, visual or audio clues.

Figure 4: Categorized examples exhibiting correct and incorrect spatio-temporal localization and answer generation, with color-coded evaluation (green: correct/within IoU threshold; red: incorrect/fails IoU).

Figure 5: Challenging cases—models struggle with chart interpretation and complex, constraint-bound counting tasks featuring highly confusable instances.

Figure 6: Scenarios requiring nuanced integration of audio and fleeting visual signals, where current models typically fail to localize key segments or perceive small objects under transient events.

Practical and Theoretical Implications

These findings delineate critical theoretical implications for video MLLM researchers:

High QA accuracy on extant benchmarks does not equate to true video understanding. Benchmarks that lack evidence verification are susceptible to answer-hallucination or pattern-matching artifacts.
Grounded, evidence-based reasoning and perception—particularly in long, cluttered, and semantically complex videos—remains a profound technical barrier.
Agentic "thinking-with-video" approaches that focus on iterative temporal zoom-in yield only incremental improvements due to underlying weaknesses in both spatial and temporal grounding.
Purely increasing sampling budgets or test-time scaling does not solve the core problem; the model's localization and integration capabilities are paramount.

From a practical standpoint, deployment of video MLLMs in domains requiring high reliability (e.g., autonomous driving, medical video analysis, surveillance) is fundamentally limited by their inability to produce verifiable evidence for their predictions.

Future Prospects

VideoZeroBench establishes a new, rigorous evaluation paradigm for video understanding. Future advances should target:

Architectures with explicit mechanisms for dynamic evidence-seeking, temporal–spatial search, and compositional reasoning.
Learning protocols that jointly optimize answer generation and grounding, potentially with explicit reward signals for evidence localization.
Richer agentic frameworks that integrate both temporal and spatial zoom-in and facilitate iterative, multi-modal hypothesis verification.

The substantial performance gap to human-level (over 67% on a subset of questions vs. < 17% for the best machine) as revealed by controlled studies underscores the magnitude of the challenge and the necessity for genuinely evidence-grounded video intelligence systems.

Conclusion

VideoZeroBench (2604.01569) rigorously demonstrates that current video MLLMs are far from genuine, trustworthy spatio-temporal understanding, with critical weaknesses in evidence-based perception, reasoning, and grounding. The field must move beyond superficial answer correctness and aggressively pursue explicit, verifiable evidence-gathering and integration as the cornerstone of future progress in video MLLMs.

Markdown Report Issue