MTAVG-Bench: Multi-Talker T2AV Evaluation
- MTAVG-Bench is a comprehensive benchmark for T2AV models that generate multi-speaker dialogue, emphasizing detailed failure annotation.
- It employs a semi-automatic pipeline, combining prompt generation, automated filtering, and expert QA mapping across nine error dimensions.
- Key evaluation metrics include audio-visual fidelity, lip-sync consistency, and turn-taking logic, guiding targeted refinements in generative systems.
MTAVG-Bench is a comprehensive diagnostic benchmark specifically designed to evaluate text-to-audio-video (T2AV) models in the challenging context of multi-talker, dialogue-centric generation. While previous evaluation protocols primarily addressed single-speaker or human-recorded audio-visual content, MTAVG-Bench introduces a structured, failure-focused evaluation pipeline for generated videos featuring two or more participants in interactive dialogue. It comprises 1,880 videos synthesized by state-of-the-art T2AV models, each accompanied by expert-annotated, fine-grained question–answer (QA) pairs that span nine error types across four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. This resource enables detailed error analysis, model comparison, and guides targeted refinements for next-generation multi-speaker T2AV systems (Zhou et al., 31 Jan 2026).
1. Dataset Construction and Curation
MTAVG-Bench was constructed via a semi-automatic pipeline comprising three stages:
a. Prompt Generation and Multi-Model Synthesis:
Seeded by thousands of short two-speaker dialogue sketches, an LLM rewrote each sketch into a structured prompt specifying character attributes (gender, age bracket, ethnicity, clothing), environment (location, time, ambience), and visual style (realistic/hyper-realistic). Speaker-centric camera instructions (e.g., "[Close shot]") were included for cinematic guidance. These prompts were input to three advanced T2AV generators—Veo 3.1, Wan 2.5, and Sora 2—producing over 3,000 diverse multi-actor, multi-turn dialogue videos represented in broad real-world scenarios.
b. Automated Filtering for Failure Emphasis:
An agent-based filter discarded candidate videos with no detectable AI generation errors (e.g., perfect lip-synch and no identity drift). This guarantees the retained benchmark corpus focuses solely on videos containing at least one substantial generation failure, thus directly facilitating diagnostic evaluation.
c. Annotation of Failures and QA Generation:
Automated detectors flagged candidate anomalies such as lip–audio mismatches or abrupt cinematic transitions for manual review. Human annotators confirmed true positives, mapped failures to a taxonomy of nine error classes, and, assisted by LLMs, generated corresponding QA pairs in single-choice, multiple-choice, or pairwise comparison formats. Expert reviewers validated and refined these QA items—resulting in 2,410 high-quality, human-verified QA pairs covering all nine diagnostic dimensions, with each video containing at least one error and an associated QA item.
Data distribution:
- 1,880 multi-talker videos
- 2,410 QA pairs
- Dominant error types are lip-sync (Level 2, LS), turn-taking logic (Level 3, TT), and speaker–utterance mismatch (Level 3, SA), with coverage across all nine diagnostic dimensions (Zhou et al., 31 Jan 2026).
2. Hierarchical Benchmark Organization
MTAVG-Bench provides a four-level, hierarchical taxonomy of diagnostic criteria, each capturing a distinct aspect of audio-visual multi-speaker dialogue generation:
| Level | Dimension (Abbreviation) | Criterion Description |
|---|---|---|
| 1 | VQ (Perceptual Video Quality) | No visual artifacts (flicker, blur) |
| 1 | SQ (Perceptual Speech Quality) | Speech naturalness, uninterrupted audio |
| 2 | SC (Scene Consistency) | Coherence of background, lighting, props |
| 2 | CC (Character/Speaker Consistency) | Identity stability: voice, appearance |
| 2 | LS (Lip–Audio Synchronization) | Alignment of mouth motion to audio |
| 3 | SA (Speaker–Utterance Alignment) | Correct attribution of utterances |
| 3 | TT (Turn-Taking Logic) | Proper dialogue turn boundaries |
| 4 | EA (Expressive Alignment) | Match of gesture/emotion with prosody |
| 4 | CA (Camera Alignment) | Camera follows active speaker |
Scoring is defined as follows for each question , ground truth , and model prediction :
$s_i = \begin{cases} \mathbb{I}[P_i = G_i], & \text{(single-choice/pairwise)} \[4pt] \frac{|P_i \cap G_i|}{|G_i|}, & \text{(multiple-choice)} \end{cases}$
Dimension-level and overall scores:
where denotes the set of QA pairs for dimension , and the nine dimensions.
This hierarchical design enables both low-level (signal fidelity) and high-level (social, cinematic) diagnostic coverage, facilitating nuanced model analysis (Zhou et al., 31 Jan 2026).
3. Diagnostic Metrics and Measurement Protocols
Each of the nine dimensions functions as a targeted evaluation metric:
- VQ: Fraction of artifact-free videos.
- SQ: Fraction of clips with natural, uninterrupted speech.
- SC: Proportion of videos maintaining consistent environment (lighting, props) across frames.
- CC: Fraction where each speaker’s appearance and voice remain consistent; Identity Drift quantifies losses here.
- LS: measures prevalence of lip–audio misalignments.
- SA: Rate of correct speaker–utterance attributions.
- TT: Dialogue sequences with valid turn-taking logic (no overlaps, missing turns).
- EA: Fraction matching gesture/emotion to prosody.
- CA: Fraction where camera focuses on the active speaker.
The annotation process blends automated failure detection (candidate anomaly proposal) with human confirmation, mapping and refinement through LLM-assisted QA item generation. Model performance is quantified by model answers on these validated QAs using the aforementioned scoring formulae (Zhou et al., 31 Jan 2026).
4. Model Benchmarking and Comparative Results
Twelve leading proprietary and open-source omni-modal T2AV systems were evaluated on MTAVG-Bench. The following table compares top models on each dimension (all values in percentages):
| Model | VQ | SQ | SC | CC | LS | SA | TT | EA | CA | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3 Pro | 70.4 | 55.3 | 53.4 | 41.2 | 52.9 | 68.6 | 60.8 | 58.0 | 50.8 | 56.8 |
| Ola-Omni (7B) | 46.4 | 56.0 | 37.5 | 36.8 | 62.0 | 52.2 | 43.3 | 46.1 | 50.0 | 47.8 |
| Video-LLaMA2 | 48.8 | 50.0 | 48.0 | 45.1 | 47.5 | 50.9 | 39.8 | 45.9 | 51.2 | 47.5 |
Key patterns observed:
- Signal fidelity (VQ, SQ) is consistently strong (≥ 45%) with relatively low variance (≤ 10 points).
- Interaction-level dimensions (SA, TT) exhibit large performance gaps (often >20 points), revealing persistent model weaknesses in assigning "who speaks when."
- Lip-sync (LS) and character consistency (CC) remain challenging: drift rates can exceed 60%.
- Cinematic metrics (EA, CA) are lowest for all systems (<60% for EA, ~50% for CA).
- Typical errors include identity drift (CC), silent lip movement (LS), unprompted utterances (TT), and camera neglect of speaker (CA) (Zhou et al., 31 Jan 2026).
5. Evaluation Methodology and Best Practices
The recommended MTAVG-Bench protocol for T2AV evaluation consists of:
- Application of the complete four-level diagnostic taxonomy for comprehensive model assessment.
- Supplying both the original generation prompt and synthesized clip to all evaluators; prompt–output misalignment can significantly degrade scores (ablation study–5 points CC/CA penalty).
- Reporting both per-dimension and overall scores to expose specific system weaknesses.
- Adhering to the provided QA item templates (single-choice, multi-choice, pairwise) and scoring formulae for robust replicability.
- Blending automated and expert-validated annotation to ensure accuracy and fine-grained coverage.
A significant implication is that low-level signal quality can mask critical high-level failures (such as incorrect speaker attribution or turn-taking flaws), necessitating a holistic metric profile in published comparisons (Zhou et al., 31 Jan 2026).
6. Limitations and Prospects for Benchmark Extension
Current limitations of MTAVG-Bench include:
- Evaluator OOD: Many vision-LLMs were not trained on the fine-grained lip-sync and time-alignment discrepancies present in T2AV generations; future work can fine-tune evaluation modules with MTAVG-Bench.
- Imbalanced error prevalence: Lip-sync (LS) and turn-taking (TT) dominate the dataset; targeted prompt engineering in future releases could achieve more uniform coverage.
- Modal diversity: Present coverage is limited to English dialogues, standard accents, and single-scene clips under 15 seconds; future updates may integrate diverse languages, accents, crowd scenes, and longer/complex multi-shot narratives.
- Automation: Improving off-the-shelf metric proxies for automatic annotation (e.g., lip-sync detectors) may further reduce curation burden.
MTAVG-Bench serves as a foundational diagnostic framework for testing and refining multi-speaker, dialogue-driven T2AV models, enabling rigorous, failure-centric analysis and benchmarking for the evolving landscape of generative audio-visual research (Zhou et al., 31 Jan 2026).