VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

Published 12 Apr 2026 in cs.SD and cs.AI | (2604.10542v1)

Abstract: Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. To address this gap, we propose VidAudio-Bench, a multi-task benchmark for V2A evaluation with four key features: (1) Broad Coverage: It encompasses four representative audio categories - sound effects, music, speech, and singing - under both V2A and Video-Text-to-Audio (VT2A) settings. (2) Extensive Evaluation: It comprises 1,634 video-text pairs and benchmarks 11 state-of-the-art generation models. (3) Comprehensive Metrics: It introduces 13 task-specific, reference-free metrics to systematically assess audio quality, video-audio consistency, and text-audio consistency. (4) Human Alignment: It validates all metrics through subjective studies, demonstrating strong consistency with human preferences. Experimental results reveal that current V2A models perform poorly in speech and singing compared to sound effects. Our VT2A results further highlight a fundamental tension between instruction following and visually grounded generation: stronger visual conditioning improves video-audio alignment, but often at the cost of generating the intended audio category. These findings establish VidAudio-Bench as a comprehensive and scalable framework for diagnosing V2A systems and provide new insights into multimodal audio generation.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces VidAudio-Bench, a novel multidimensional evaluation framework targeting fine-grained video-to-audio (V2A) and video-text-to-audio (VT2A) generation.
It segments audio tasks into SFX, music, speech, and singing, employing specialized metrics for audio fidelity, synchronization, and semantic alignment.
Experimental results highlight robust performance in SFX and instrumental music generation while revealing significant challenges in speech and singing generation due to trade-offs between visual grounding and instruction following.

VidAudio-Bench: A Comprehensive Benchmark for Fine-Grained Evaluation of Video-to-Audio and Video-Text-to-Audio Generation

Benchmark Scope and Motivation

The VidAudio-Bench framework addresses a persistent evaluation deficit in multimodal audio generation, particularly in the video-to-audio (V2A) and video-text-to-audio (VT2A) paradigms. Existing benchmarks have predominantly applied monolithic, distribution-level metrics that disregard the heterogeneity of audio tasks — notably the distinct requirements posed by sound effects (SFX), music, speech, and singing. Current practices inadequately capture fine-grained attributes such as lip synchronization, semantic alignment, and task-specific generative fidelity. VidAudio-Bench is constructed to enable a multi-perspective, category-aware analysis, reflecting recent advances and necessities in modeling frameworks, including unified multimodal diffusion transformers and chain-of-thought guided LLM-based generation.

Dataset Organization and Task Taxonomy

VidAudio-Bench comprises 1,634 carefully curated, high-resolution, and audio-stripped video clips, distributed across four primary audio generation tasks: SFX, music (with Instrumental Performance and Background Music as subtypes), speech, and singing. The dataset leverages large-scale, publicly available sources (e.g., VGGSounder for SFX/instruments, HarmonySet for BGM, AVSpeech for speech, and Acappella for singing), filtered via strict criteria to ensure clear visual grounding and unambiguous audio-visual correlation.

The SFX subset spans 10 major categories and 29 subcategories, presenting diverse and balanced audit-visual event coverage.

Figure 1: Distribution of sound effects data, highlighting the categorical and subcategorical diversity utilized for evaluation.

Instrumental music is further stratified by instrument classes, and a detailed demographic analysis is performed for speech and singing (e.g., age and gender distributions).

Figure 2: (a) Category breakdown of the Instrument Performance subset; (b) demographics for Speech and Singing categories.

V2A vs. VT2A Paradigms

A salient contribution is the explicit contrast between V2A (video-only, with minimal task-level instructions) and VT2A (video and dense visual caption input) generation paradigms. The VT2A setup employs a zero-information-leak protocol, utilizing vision-LLMs (Qwen3-VL) to generate fine-grained, audio-agnostic scene descriptors, thereby disentangling visual understanding from explicit acoustic label leakage. Human/LLM validation confirms high semantic retention in VT2A-generated prompts for SFX, BGM, and speech, while singing remains more challenging.

Multi-Dimensional, Reference-Free Evaluation Framework

VidAudio-Bench introduces a comprehensive evaluation suite distributed over 13 dimensions, organized along three principal axes: Audio Quality (AQ), Video-Audio Consistency (VAC), and Text-Audio Consistency (TAC). The suite integrates both signal-level and semantic metrics:

AQ: Audio-MAE-based Fréchet Distance for fidelity, Audiobox-Aesthetics for production quality, STOI-Net for (non-intrusive) intelligibility, a normalized Musicality Score (integrating Pitch Class Histogram Entropy, Grooving Pattern Similarity, Empty Beat Rate), and DNSMOS-Pro/SingMOS-Pro for neural MOS estimation.
VAC: Event-level (DeSync) and lip-level (LatentSync) synchronization, rhythmic alignment for music/BGM, semantic correspondence in a FreeBind-augmented embedding space, and demographic/affective (emotion, intensity) alignment via MLLM-judged prompts.
TAC: CLAP-based audio-text embedding alignment and instruction-following verification through LLM-judge protocols.

The framework is realized via an MLLM-as-a-Judge model for advanced dimensions, employing multi-step reasoning for cross-modal interpretability.

Figure 3: System-level overview of the VidAudio-Bench evaluation architecture, outlining task-specific metric groupings and LLM-judged alignment dimensions.

Experimental Results and Diagnostic Insights

A comprehensive benchmarking campaign covers eight SOTA V2A/VT2A models (AudioX, FoleyCrafter, HunyuanVideo-Foley, Kling-Foley, MMAudio, ReWaS, ThinkSound, UniFlow-Audio) and three specialized V2M models. VidAudio-Bench results decisively demonstrate task-domain difficulty stratification: models display robust SFX and (to a lesser extent) instrumental music generation, while speech and singing generation remain substantially underperforming — reflected both in objective metrics (e.g., low intelligibility and sync) and human-aligned perceptual scores.

Task-wise radar plots further elucidate the divergent strengths and trade-offs among models: for example, Kling achieves maximal video-audio semantic alignment, while AudioX and ThinkSound lead in aesthetic and perceptual quality. No model presents Pareto-optimality; models with improved visual consistency do not necessarily retain high fidelities or category-level instruction compliance.

Figure 4: Task-specific performance distributions (radar plots) for SFX, music, speech, and singing across representative models.

Tension Between Visual Grounding and Instruction Following

A critical empirical observation is the antagonistic relationship between visual grounding and instruction following under VT2A evaluation. As shown quantitatively, providing dense visual descriptions often increases V-A semantic correspondence scores but degrades the probability of the model generating audio of the intended target category (especially for BGM and singing). This is directly attributed to the distraction or semantic bias induced by detailed, but non-categorical, visual cues. For instance, in BGM, action-centric captions lead to event-driven SFX generation, not musical background, violating the task condition.

Human Alignment Validation

VidAudio-Bench is rigorously validated with human subjective studies, employing controlled groupwise annotation. Correlative analyses demonstrate high Pearson coefficients between benchmark metric values and human win rates across semantic, realism, and sync dimensions, demonstrating robust perceptual validity.

Figure 5: Pearson correlation between human preference rates and VidAudio-Bench metric scores, confirming perceptual and instructional alignment.

Instruction-following is further verified via binary classification against human-labeled categories, achieving $\geq 0.8$ F1-scores across all classes.

Practical and Theoretical Implications

The benchmark provides the first large-scale, multidimensional, category-stratified testbed for V2A/VT2A system design and validation. The results indicate that unified V2A or multimodal generation systems are fundamentally limited by the intersectional challenge of optimizing audio fidelity, cross-modal grounding, and robust instruction compliance, especially in voice-based and highly structured musical tasks. The observed breakdowns in VT2A reinforce the need for advanced alignment objectives and the integration of targeted negative prompting or adaptive input representations. Future work may leverage these findings to develop compositional bridging architectures, advanced regularization schemes, or user-in-the-loop fine-tuning mechanisms that better arbitrate the trade-offs between category control and semantic grounding.

Conclusion

VidAudio-Bench establishes a new standard for benchmarking multimodal audio generation tasks. By systematically unifying data construction, input paradigms, and automated/human-aligned metric design, it exposes the measurable limitations of current V2A/VT2A models and provides actionable diagnostics for future research and model improvement. This framework is instrumental for the community focus on nuanced, perceptually aligned, and category-specific audio generation, especially as multimodal content synthesis systems continue to expand in both scope and application.

Reference: "VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories" (2604.10542)

Markdown Report Issue