MVBench: Video Understanding Benchmark

Updated 9 April 2026

MVBench is a multi-modal video understanding benchmark that transforms static tasks into dynamic, temporally grounded MCQA challenges.
It systematically converts classical visual QA into video-specific tasks, compelling models to reason across multiple frames.
The unified evaluation protocol, featuring metrics like exact-match accuracy and techniques such as causal cross-attention, benchmarks temporal reasoning.

MVBench (Multi-modal Video Understanding Benchmark) is a comprehensive, large-scale evaluation suite designed to rigorously test the temporal reasoning, perception, and high-level cognition capabilities of multi-modal LLMs (MLLMs) on video. It specifically addresses the limitations of prior multimodal benchmarks that focus on static images and spatial reasoning by introducing 20 temporally grounded video tasks, each formulated to require explicit multi-frame understanding. MVBench provides a unified multiple-choice question answering (MCQA) protocol, accompanied by ground-truth video annotations, enabling fair and reproducible assessment of model performance on complex, dynamic video phenomena (Li et al., 2023).

1. Benchmark Structure and Task Taxonomy

MVBench covers 20 tasks systematically constructed via a "static-to-dynamic" paradigm, in which classical visual question answering tasks are transformed into video-specific variants that cannot be reliably solved from a single frame. The tasks are organized into four main groups (Li et al., 2023, Maaz et al., 2024):

Temporal Perception (13 tasks):

Action Sequence, Action Prediction, Action Antonym, Fine-grained Action, Unexpected Action, Object Existence, Object Interaction, Object Shuffle, Moving Direction, Action Localization, Scene Transition, Action Count, Moving Count.

Attribute and State (3 tasks):

Moving Attribute, State Change, Fine-grained Pose.

Symbolic Ordering (1 task):

Character Order.

High-Level Cognition (3 tasks):

Egocentric Navigation, Episodic Reasoning, Counterfactual Inference.

Tasks cover a spectrum of temporal skills from motion detection and event counting, to non-trivial causal and counterfactual inference. Representative examples include distinguishing the correct order of sequential actions ("Which event happened first?"), counting the number of repetitions of a specific action, and predicting hypothetical outcomes given interventions.

All questions are rendered in a multiple-choice format (typically 4 or 5 options), each corresponding to a unique correct answer and several distractors. Videos derive from large, diverse sources including Kinetics-710, Something-Something-v2, CLEVRER, NExT-QA, and others, with strict quality filtering (5–35s duration, multiperspective, varied domains) (Maaz et al., 2024, Li et al., 2023).

2. Dataset Construction and Annotation

MVBench adopts a largely automated task-generation pipeline to ensure scale, fairness, and precision:

Source datasets: Eleven public video datasets, encompassing first and third-person viewpoints, indoor/outdoor scenes, movies, human activities, and synthetic environments.
Question–answer generation:

Most subtasks employ templated prompt conversion, plugging in existing ground-truth annotations (action labels, timestamps, object IDs) to systematically produce temporally sensitive MCQA items.
Distractor curation:

Distractors are designed to be similar in length and semantics to the ground-truth option; for open-ended tasks, ChatGPT is used to create plausible distractors.
Validation:

Every item is checked by a secondary annotator to minimize ambiguity.

The benchmark contains 4,000 question–answer pairs (200 per sub-task), with a strict separation of train/test splits; no MVBench training set is released, so all models are evaluated zero-shot or after external instruction/fine-tuning (Fei et al., 2024, Ye et al., 24 Mar 2025).

3. Evaluation Protocol and Metrics

Evaluation is unified under the standard exact-match accuracy metric:

$\mathrm{Accuracy} = \frac{\#\{\text{correctly answered questions}\}}{\#\{\text{total questions}\}}$

For a model's predictions $\{\hat{y}_i\}_{i=1}^N$ and ground-truth labels $\{y_i\}_{i=1}^N$ , accuracy per task and overall is given by:

$\mathrm{Accuracy}_{\mathrm{task}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{y}_i = y_i]$

Only MCQA is considered; no free-form or captioning metrics are included. Accuracy is reported both per-task and as a macro-average across the 20 tasks.

Variants such as "top-1 multiple-choice accuracy," "macro-average accuracy," and per-prompt performance (in prompt-sensitivity experiments) are compatible and reduce to this core metric (Ismithdeen et al., 4 Sep 2025, Ye et al., 24 Mar 2025, He et al., 20 Mar 2026).

4. Model Performance and Benchmark Insights

Table: Representative Model Results on MVBench (all numbers: overall accuracy %, highest open-source bolded; [model size, frames if reported])

Model	Accuracy	Notes
Random	27.3	Baseline
VideoChat2-7B	51.1	Original baseline (Li et al., 2023)
VideoGPT+	58.7	Dual video+image fusion (Maaz et al., 2024)
VideoChat2-HD	62.3	Enhanced baseline
Flash-VStream	65.4	Real-time, long-video (Zhang et al., 30 Jun 2025)
Video-CCAM-9B	64.6	Causal masking; SOTA (open-source)
EARL (Evidence RL)	69.0	Evidence-prioritized RL (Li et al., 17 Oct 2025)
Leum-VL-8B	70.0	SV6D representation (He et al., 20 Mar 2026)
STAVEQ2-7B	70.1	Stacked temporal attn (Rasekh et al., 29 Oct 2025)
Qwen2.5-VL-7B	69.6	Strong baseline
MOSS-ChatV-7B	67.6	Process-reward RL (Tao et al., 25 Sep 2025)

Closed-source, proprietary models such as GPT-4V and Gemini 1.5 Pro, when evaluated, exhibit lower accuracy (43.5–75.0%) depending on test split and prompt engineering (He et al., 20 Mar 2026, Ye et al., 24 Mar 2025, Ismithdeen et al., 4 Sep 2025).

Key findings:

Temporal reasoning is pivotal: MVBench is specifically designed so that single-frame or static models cannot exceed random performance.
Prompt sensitivity is substantial: For many models, variation in prompt wording yields up to 15% accuracy swing (Ismithdeen et al., 4 Sep 2025).
Architectural advances specifically targeting temporal fusion, causal cross-attention, evidence selection, or process-supervision yield substantial gains (up to ~70%).
Action recognition and temporal compositionality subtasks (counting, ordering, causal inference) are the most challenging, and are the most diagnostic of underlying model temporal modeling capabilities.

5. Algorithmic Advances and Interpretations

MVBench catalyses development of video-LLMs equipped with dedicated temporal processing. Notable methods demonstrating state-of-the-art results:

Causal cross-attention masks (CCAM): Enforce temporally ordered fusion, with theoretical guarantees for variable-length consistency (Fei et al., 2024).
Evidence-aware RL (EARL): Dynamically selects minimal, high-purity frame evidence and performs local re-sampling, exploiting a multi-component reward for correctness and IoU-based purity, resulting in best-in-class accuracy (Li et al., 17 Oct 2025).
Process Reasoning Reward (PRR): DTW-based alignment between model’s chain-of-thought traces and ground-truth reference, discouraging temporal “hacks” and improving reasoning coherence (Tao et al., 25 Sep 2025).
Goal-driven data selection (GDO): Upweights motion-rich, temporally necessary, and video-dependent samples to rapidly close the performance gap with far less data (Wu et al., 12 Mar 2026).
Stacked temporal attention and time gating: Vision encoders directly interleave spatial and per-patch temporal self-attention or learn input-adaptive time gates, enabling robust query-specific temporal focus and boosting MVBench accuracy (Rasekh et al., 29 Oct 2025, Hu et al., 2024).

A plausible implication is that high MVBench performance is tightly linked to the model’s ability to exploit localized, temporally relevant signal and to avoid global averaging.

6. Limitations, Prompt Sensitivity, and Future Directions

Prompt Sensitivity:

Systematic probing with the 61-prompt “Promptception” suite has revealed that proprietary models (GPT-4o, Gemini 1.5 Pro) are more prompt-sensitive (±3 pp SD, 15% swing) than open-source models, but can leverage positive prompt structure for higher peaks. Stable, concise prompts emphasizing video structure (observation-driven, chronological frame analysis) consistently yield the best MVBench performance (Ismithdeen et al., 4 Sep 2025).

Unsolved Challenges:

MVBench still exposes significant model deficiencies:

Counting, fine-grained spatial grounding, and counterfactual inference (especially under occlusion or rapid transitions) remain open problems (Li et al., 2023, Maaz et al., 2024, Rasekh et al., 29 Oct 2025).
Absence of audio or subtitle modalities in the default MVBench protocol limits evaluation for video-LLMs tuned on multi-modal inputs (Zhang et al., 30 Jun 2025).
Overfitting to positional biases may artificially inflate performance unless task and prompt variation are rigorously considered (Ismithdeen et al., 4 Sep 2025).

Future Directions:

Open questions include the extension of MVBench to:

Long-form video (beyond 32 frames or 40s), continual and streaming scenarios (Zhang et al., 30 Jun 2025).
Multi-video or multi-view simultaneity (as in MVPBench), with richer cross-sequence compositional queries (Bai et al., 24 Mar 2026).
Integrated evaluation including subtitles, audio, and cross-modal reasoning, as new video-LLMs incorporate richer inputs.

MVBench continues to act as the de facto gold-standard for temporal video understanding in MLLMs. It is a critical diagnostic and training reference for benchmarking advances in video-language grounding, action recognition, and multi-frame inference.

7. Summary Table: MVBench Key Parameters and Best Results

Aspect	Value/Protocol
# Tasks	20 (temporal video MCQA)
# QA Pairs	4,000 (200 per task) (Li et al., 2023 Maaz et al., 2024)
Video Length	5–40 s typical; 8–32 frames
Data Sources	≥11 public datasets; no MVBench-train split
Evaluation	Multiple-choice accuracy (macro-avg/overall)
SOTA Accuracy	70.0%+ (Leum-VL-8B, STAVEQ2-7B, EARL-7B)
Access	https://github.com/OpenGVLab/Ask-Anything