MT-Video-Bench Evaluation

Updated 2 March 2026

MT-Video-Bench is a dual-framework benchmark designed to evaluate multi-turn video dialogues in both multimodal video understanding and text-to-audio-video synthesis settings.
The MLLM variant emphasizes six core competencies such as object reference, memory recall, and proactive interaction through a dataset curated with advanced scene segmentation and human validation.
The MTAVG-Bench variant focuses on multi-speaker dialogue synthesis, providing detailed error analysis in audio-visual fidelity, temporal consistency, and cinematic expression across thousands of QA pairs.

MT-Video-Bench, designating benchmarks aimed at multi-turn dialogue in the context of video, encompasses two prominent frameworks: one for multimodal LLM (MLLM) video understanding in multi-turn dialogue settings (Pan et al., 20 Oct 2025), and the other for text-to-audio-video (T2AV) generation with multi-speaker dialogue (Zhou et al., 31 Jan 2026). Both variants address the need for rigorous, context-aware evaluation in video-centric AI, but each emphasizes different modalities and evaluation axes.

1. Benchmarking Objectives and Scope

MT-Video-Bench for MLLM evaluation (Pan et al., 20 Oct 2025) is constructed to address the limits of single-turn video QA by targeting six core multi-turn competencies: Object Reference (OR), Memory Recall (MR), Content Summary (CS), Answer Refusal (AR), Topic Shifting (TS), and Proactive Interaction (PI). These tasks simulate real-world human–AI assistant settings where conversational history and evolving context must be robustly managed.

For T2AV generation, MT-Video-Bench (referred to as MTAVG-Bench (Zhou et al., 31 Jan 2026)) addresses challenges in evaluating synthesized audio-visual videos depicting multi-participant dialogues. It exposes failure modes—such as identity drift, unnatural turn transitions, and audio-visual misalignment—unobservable in single-speaker or human-recorded benchmarks. The design foregrounds multi-speaker interactions with four distinct evaluation levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression.

2. Dataset Curation and Construction Pipeline

The MLLM-focused MT-Video-Bench (Pan et al., 20 Oct 2025) curates 987 multi-turn dialogues from 135 videos spanning five domains (Movies, TV, Sports, Knowledge, Life Record). Scene segmentation and merging are handled using PySceneDetect and Gemini 2.5 Flash, while YOLOv11 and captioning via Gemini build a cross-scene object memory bank. Gemini 2.5 Pro drafts 5–8-round dialogues per scene segment, subsequently filtered by task suitability and human annotation, resulting in a dataset of 5,805 QA pairs.

MTAVG-Bench (Zhou et al., 31 Jan 2026) employs a semi-automatic pipeline. Starting from “context + conversation” pairs specifying emotional states and dialogue, an LLM rewrites prompts to emphasize cinematic, speaker-centric details (character traits, environment, visual style, camera direction, non-verbal cues). T2AV models (Veo 3.1, Wan 2.5, Sora 2) synthesize ≈1.8k multi-turn, multi-speaker videos. Automated filtering removes flawless (i.e., uninformative) outputs, while remaining videos containing ≥1 suspected error are annotated by humans and mapped to nine fine-grained error dimensions. Each error yields a diagnostic question with LLM-generated candidates, finalized by human experts. The result is 2,410 QA pairs indexed across 1,880 videos.

3. Evaluation Levels, Task Design, and Taxonomy

The MLLM benchmark (Pan et al., 20 Oct 2025) operationalizes its six competencies through dialogue-based tasks in multi-turn settings:

Perceptivity:
- OR (pronoun and referent resolution)
- MR (tracking and reasoning over multi-turn information)
- CS (cohesive video+dialogue summarization)
Interactivity:
- AR (refusal on unanswerable queries)
- TS (adapting to topic shifts)
- PI (generating engagement-driving responses)

Each task is instantiated via 5–8 alternating user-assistant QA rounds per dialogue, with explicit reference to prior turns and multimodal evidence.

MTAVG-Bench (Zhou et al., 31 Jan 2026) introduces a four-level hierarchical taxonomy:

Audio-Visual Signal Fidelity: Video sharpness, temporal stability, and speech prosody.
Temporal Attribute Consistency: Scene stability, character identity persistence, and lip-sync accuracy.
Social Interaction: Speaker-utterance alignment and natural turn-taking logic.
Cinematic Expression: Expressive gesture/voice/face alignment to emotion, and adherence to cinematic camera conventions.

Failures exposed in error annotation span identity drift, turn-taking lapses, expression-prosody mismatches, and misaligned shots.

4. Annotation and Human-in-the-Loop Procedures

The MLLM benchmark (Pan et al., 20 Oct 2025) employs two-stage human validation: Stage 1 removes dialogues solvable solely via dialogue leakage; Stage 2 validates grounding, unambiguous referents, and alignment with intended tasks. Each QA is annotated with a standard answer and a 5-item yes/no checklist evaluating task sub-aspects for downstream scoring.

In MTAVG-Bench (Zhou et al., 31 Jan 2026), automated agents perform preliminary error tagging, followed by human confirmation and mapping to sub-dimensions. Each confirmed failure occasion generates a diagnostic QA pair, whose candidate answer set and distractors are iteratively refined by human experts to ensure precision, unique testability, and explicit reference to observable video/audio evidence. QA items disallow hypothetical or synthetic errors; every question is grounded in an observed instance.

5. Quantitative Metrics and Model Evaluation

MT-Video-Bench for MLLMs (Pan et al., 20 Oct 2025) reports accuracy per checklist item, as well as Precision, Recall, and F1 for each core competency: $\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$

$\mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$

$\mathrm{F1} = 2\times \frac{\mathrm{Precision}\times\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Overall accuracy (ACC) is averaged across all checklist items and reported per-task and overall.

MTAVG-Bench (Zhou et al., 31 Jan 2026) defines for each QA: $s_i = \begin{cases} \mathbf{1}[P_i = G_i], & \text{(single-choice or pairwise)} \ \dfrac{\lvert P_i \cap G_i \rvert}{\lvert G_i \rvert}, & \text{(multiple-choice)} \end{cases}$ with dimension-level and overall scores: $\mathrm{Score}_d = \frac{1}{\lvert \mathcal{Q}_d \rvert} \sum_{i \in \mathcal{Q}_d} s_i$

$\mathrm{Avg} = \frac{1}{|\mathcal{D}|} \sum_{d \in \mathcal{D}} \mathrm{Score}_d$

6. Baseline Results and Observed Failure Modes

On the MLLM benchmark (Pan et al., 20 Oct 2025), Gemini 2.5 Pro leads at 68.45% ACC, followed by Gemini 2.5 Flash (63.30%) and Qwen2.5-VL-72B as top open-source (58.48%). Average per-task performance highlights significant gaps in AR (44.6%) and PI (38.6%) versus perceptivity tasks (CS: 63.4%). Context-aware (golden history) models consistently outperform self-predicted history; performance declines with more scenes, turns, or video length, indicating challenges in long-term contextual integration. Systematic failures include hallucinated answers instead of correct refusals, pronoun ambiguity, topic shift mishandling, and low engagement.

For MTAVG-Bench (Zhou et al., 31 Jan 2026), Gemini 3 Pro achieves the best overall $\mathrm{Avg}=56.84\%$ , excelling in social alignment (SA: 68.63, TT: 60.83) and expressive alignment (EA: 58.03). Ola-Omni ranks as the best open-source model ( $\mathrm{Avg}=47.8\%$ ), remaining competitive in perceptual (VQ/SQ) metrics, but underperforming on social and cinematic dimensions. The fine-grained design enables precise attribution of errors such as identity drift, turn logic lapses, misaligned cues, and cinematic incoherence.

7. Research Implications and Future Directions

Both MT-Video-Bench variants address previously unmet needs in video dialogue benchmarking by uncovering failure modes at the intersection of perception, social reasoning, and generation. The MLLM evaluation benchmark underscores the need for explicit long-range memory, calibrated refusal mechanisms, dialogue-aware encoders, self-refinement, and integration of multimodal generation. MTAVG-Bench (Zhou et al., 31 Jan 2026) demonstrates the necessity of evaluation taxonomies that transcend low-level realism, enabling systematic model diagnosis and supervised fine-tuning on social and cinematic attributes.

Future recommendations include the extension of MTAVG-Bench to more diverse domains (e.g., outdoor, crowded, or complex camera-choreographed scenes), leveraging QA-pair annotation as a reinforcement or supervised fine-tuning signal, and development of automated evaluators harnessing detailed failure-mode annotations. For the MLLM benchmark, directions include optimizing memory and reasoning for longer, cross-scene video dialogues, advancing refusal and proactivity policies, and incorporating multi-agent or adversarial evaluation components.

8. Tabular Comparison of MT-Video-Bench Variants

Aspect	MLLM MT-Video-Bench (Pan et al., 20 Oct 2025)	MTAVG-Bench (Zhou et al., 31 Jan 2026)
Purpose	Multi-turn video dialogue understanding	Multi-speaker audio-video generation evaluation
Core Evaluation Axes	Perceptivity, Interactivity	AV Fidelity, Temporal Consistency, Social/Cinema
Dataset Size	987 dialogues, 5,805 QA	1,880 videos, 2,410 QA
Modalities	Video + Dialogue (understanding)	Video + Audio + Generation prompts (synthesis)
Key Metrics	ACC, Precision, Recall, F1	Per-QA/dimension/overall ( $\mathrm{Avg}$ )
Leading Models	Gemini 2.5 Pro (68.45% ACC)	Gemini 3 Pro (56.84% Avg), Ola-Omni (47.8%)

In sum, MT-Video-Bench (and MTAVG-Bench) provides comprehensive, failure-driven benchmarks across the spectrum of video-grounded dialogue systems, catalyzing model development toward true contextual and expressive competency in multi-turn, multi-modal video dialogue settings (Pan et al., 20 Oct 2025, Zhou et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues (2025)

MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MT-Video-Bench.