MME-CoF: Chain-of-Frame Video Benchmark

Updated 3 July 2026

MME-CoF is a specialized evaluation benchmark that systematically assesses zero-shot, chain-of-frame reasoning in text-to-video generative models.
It comprises 59 curated prompts across 12 reasoning categories, testing spatial, geometric, and temporal capabilities in a fine-grained, stepwise manner.
The benchmark employs automated LLM-based scoring with metrics like Instruction Alignment and Temporal Consistency to diagnose model strengths and failure modes.

MME-CoF (Multi-Modal Evaluation–Chain of Frame) is a specialized benchmark developed to systematically and reproducibly assess the reasoning abilities of modern text-to-video generative models through fine-grained, stepwise, zero-shot visual reasoning. Unlike prior benchmarks that emphasize visual fidelity, MME-CoF targets "Chain-of-Frame" (CoF) reasoning, probing whether generative models can successfully follow instructions in a temporally coherent, causally faithful, frame-by-frame manner. This approach captures the capacity for explicit, interpretable reasoning over video sequences, enabling rigorous diagnosis of both strengths and failure modes in emerging video foundation models (Guo et al., 30 Oct 2025).

1. Motivation and Rationale

MME-CoF was introduced to address a distinct gap in the evaluation framework for video generative models. While major advances have resulted in models capable of synthesizing visually plausible and temporally smooth video, the capacity of such models to serve as zero-shot visual reasoners—executing multi-step logical, geometric, or physical instructions—is relatively unexplored. Traditional metrics penalize only superficial artifacts, neglecting failures in intermediate reasoning steps, such as violating geometric constraints or skipping logical progression. MME-CoF aims to provide a systematic, compact, and reproducible suite to expose these model limitations, focusing on the "Chain-of-Frame" paradigm: evaluating whether generative updates in each frame collectively realize complex, multi-step prompts without external model adaptation or auxiliary tools. The guiding design requirements include:

Comprehensive taxonomy of reasoning types (12 categories).
Uniform and unambiguous prompt protocol, minimizing linguistic bias.
Pure zero-shot evaluation (no fine-tuning).
Fully automated, cross-laboratory reproducibility via public code and data (Guo et al., 30 Oct 2025).

2. Benchmark Structure and Content

The MME-CoF corpus comprises 59 carefully curated prompts, spanning 12 reasoning dimensions:

Visual Detail Reasoning
Visual Trace (sequential path)
Real-world Spatial
3D Geometry
2D Geometry
Physics-based Reasoning
Rotation
Table & Chart
Object Counting 10. GUI Interaction
Embodied Manipulation/Affordances
Medical Image Reasoning

Each prompt was distilled from prominent video, spatial, and reasoning benchmarks (e.g., V*Bench, ChartQA, RBench-V). Expert review established imperative, unambiguous instructions, most specifying a static camera and duration (8 s at 1280×720, 24 FPS), as well as explicit disallowance of camera movements when not required. Annotation encodes the reasoning objective within the prompt itself (e.g., a navigation path, geometric transformation). There is no manual video labeling; evaluation is automated via a large vision-LLM (Guo et al., 30 Oct 2025).

Benchmark Statistics

Attribute	Value	Notes
Total entries	59	~5 per category
Categories	12	Reasoning types
Avg. prompt length	36.7 tokens
Max in category	7

This compact design enables in-depth quantitative and qualitative analysis, while remaining tractable for cross-model comparisons.

3. Evaluation Methodology and Metrics

Each model under test is evaluated in a pure zero-shot regime, with default parameters and APIs, generating six independent videos per prompt. Automated scoring is performed by an LLM verifier (e.g., Gemini-2.5-Pro) using five criteria:

Instruction Alignment ( $A_{ij}$ )
Temporal Consistency ( $T_{ij}$ )
Visual Stability ( $S_{ij}$ )
Content Fidelity ( $F_{ij}$ )
Focus Relevance ( $R_{ij}$ )

Each is rated 0–4, yielding a sample-level overall score:

$S^{(ij)} = \frac{A_{ij} + T_{ij} + S_{ij} + F_{ij} + R_{ij}}{5}$

Aggregate scores are computed per prompt ( $S_i$ ) and per model ( $S_\text{model}$ ) as mean averages. Qualitative success rate (green/orange/red) is used for comparative human-aligned assessment:

$SR_i = \frac{\# \text{videos} \geq \text{orange}}{6}$

No validation or test splits are defined; all prompts are held-out for evaluation.

4. Chain-of-Frame (CoF) Reasoning Paradigm

CoF posits that effective visual reasoning in generative videos requires each frame to represent a discrete, interpretable step toward fulfilling a complex instruction. Formally, for prompt $p$ and horizon $T_{ij}$ 0, frames $T_{ij}$ 1 are generated recursively: $T_{ij}$ 3 Inspection of $T_{ij}$ 2 enables assessment of stepwise adherence to the instruction, capturing failures such as skipped transitions, geometric violations, or improper object states. Progressive, causal continuity and strict constraint adherence are the target behaviors (Guo et al., 30 Oct 2025).

5. Key Findings: Performance and Failure Analysis

State-of-the-art video generation models such as Veo-3 and Sora-2 achieve only modest reasoning performance, with model-level averages between 0.6 and 1.7 out of 4. Notably:

Strengths: Visual Stability (2.3/4), short-horizon spatial coherence (Visual Detail ~1.3/4), local temporal consistency for simple path tracing (~1.5/4), and one-step geometric/rotational manipulations (3D Geometry ~1.7/4, Rotation ~1.8/4 on best models).
Weaknesses: Instruction Alignment (lowest: ~0.5/4), long-horizon causal planning (Visual Trace falters beyond 4–5 steps), strict 2D/3D geometric reasoning, quantitive physical realism (energy/momentum often not conserved), embodied logic (e.g., manipulation, affordances), and domain-specific (especially medical) reasoning.
Failure Modes: Center-of-mass bias (model drifts toward salient but unintended regions), over-generalization (ignoring explicit constraints), hallucination of objects or scene modifications, geometric violations (e.g., face intersection in folding), and temporal jumps (omission of intermediate steps) (Guo et al., 30 Oct 2025).

6. Illustrative Examples

Illustrative prompts highlight CoF reasoning evaluation:

Visual Detail: Prompt: "Static front view of a red leather handbag…zoom in step-by-step until stitching is sharply visible." Correct sequence involves progressive zoom and color/material consistency.
Visual Trace: Prompt: "Top-down maze, blue pawn moves cell-by-cell without jumps." Correct CoF: a stepwise, continuous trace with no disallowed jumps.
3D Geometry: Prompt: "Show net of cube, fold stepwise along dotted lines with each face rotating 90°." Expected: discrete, physically plausible transformation steps, no face flipping.
Physics-Based: Prompt: "Steel ball slides, elastically reflects from wall at 45°." Expected: consistent, conserved trajectory before and after collision (Guo et al., 30 Oct 2025).

Comprehensive prompt, script, and output details are available in the MME-CoF public repository.

7. Model Benchmarking and Research Impact

MME-CoF has enabled reproducible, head-to-head benchmarking of generative models under a uniform, zero-shot, multi-dimensional reasoning challenge. The results indicate that, despite advances in fidelity and temporal realism, current state-of-the-art models remain unreliable as autonomous zero-shot visual reasoners, particularly for long-range, strictly logical, or domain-transfer reasoning. They nonetheless show promise as visual engines when coupled with higher-level reasoning or verification modules. The compact, publicly available benchmark has become a standard for diagnosing and directing next research steps in video model reasoning (Guo et al., 30 Oct 2025).

8. Usage and Extensibility

All code, data, and evaluation tools for MME-CoF are provided for reproducing results and testing new models. The workflow consists of cloning the repository, installing dependencies, collecting API keys, running batch video generation followed by automated LLM-based evaluation, and final aggregation of derived metrics. This pipeline enables laboratories to directly compare new releases against published baselines and facilitates cross-institutional assessment. A comprehensive suite of reproducing scripts supports full transparency and repeatability (Guo et al., 30 Oct 2025).

MME-CoF has established a reproducible, diagnosis-oriented paradigm for probing the zero-shot reasoning capabilities of modern text-to-video models in a manner orthogonal to traditional generative metrics, driving both analysis and future architectural innovation in generative video reasoning systems.

Markdown Report Issue Upgrade to Chat

References (1)

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MME-CoF.