AV-ConfuseBench: Audio-Visual Benchmark

Updated 16 November 2025

The paper introduces AV-ConfuseBench, a benchmark that exposes MLLMs' reliance on visual cues and frequent hallucination of audio in mismatched scenarios.
The benchmark utilizes binary and open-ended QA formats to assess performance on audio-muted and audio-modified confusions, employing metrics like Accuracy, Yes-Rate, A-Acc, and V-Acc.
RL-CoMM applies a reinforcement learning-based protocol with audio-only reasoning to significantly boost cross-modal alignment, achieving up to +24.5% accuracy improvement.

AV-ConfuseBench is a benchmark designed to systematically evaluate the ability of Multimodal LLMs (MLLMs) to discern "audio-visual confusion," a phenomenon where objects visually present in a video are rendered absent or mismatched in the audio modality. The benchmark exposes critical limitations in MLLMs, which typically rely on visually dominated reasoning and frequently hallucinate the presence of audio features that do not exist within the corresponding sound tracks. This framework is paired with RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM solution intended to alleviate such confusion through the explicit integration of audio-only reasoning and reward-driven optimization. AV-ConfuseBench introduces two principal scenarios—audio-muted and audio-modified confusions—to probe and quantify modeling failures and improvements in cross-modal semantic alignment.

1. Dataset Construction

AV-ConfuseBench comprises two primary confusion scenarios: Audio-Muted Confusion and Audio-Modified Confusion. In Audio-Muted Confusion, 39 music video clips containing multiple sounding objects (e.g., cello, guitar, piano) form the basis. In each clip, one instrument is muted, erasing its audio track entirely while preserving the visual representation. This yields 73 question–answer pairs, always querying the presence of the muted-object’s sound, with ground truth consistently "No." These pairs span a variety of musical instruments and are solely constructed to induce confusion, with no control (unmodified) set in this subtask.

Audio-Modified Confusion is structured by replacing the entire soundtrack in 20 real-world video clips with one of five unrelated environmental sounds (wind, bird, rain, electric drill, thunder). Each of the 20+5 combinations yields a free-description prompt: “Describe what you see and what you hear,” accumulating 100 QA pairs manually labeled for both visual and audio content. The absence of any control scenarios in both subtasks means all test samples present confounding cross-modal signals.

Task formats are binary (“Is there a/an {muted-object} sound?”) for audio-muting, and open-ended descriptions for audio-modified confusion.

2. Benchmark Protocols and Evaluation Metrics

AV-ConfuseBench introduces rigorous evaluation protocols for both confusion settings. In Audio-Muted Confusion, the principal metrics are Accuracy (Acc), defined as the fraction of correct yes/no answers matching ground truths ( $\text{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbb{1}[\text{answer}_i = \text{GT}_i]$ ), and Yes-Rate, quantifying the proportion of model responses incorrectly affirming the muted-object’s sound. High Yes-Rate directly correlates with visually dominated reasoning and hallucination.

In Audio-Modified Confusion, textual audio and visual descriptions generated by the models are graded by GPT-4 against manually curated ground truths on a 0–5 scale, reported as Audio Accuracy (A-Acc) and Visual Accuracy (V-Acc). The propensity for cross-modal confusion and hallucination is further quantified by elevated Yes-Rates and reduced A-Acc values, marking an inability of MLLMs to decipher mismatched or absent audio information when the visual signal is strong.

3. Baseline Model Performance

Evaluation of open- and closed-source MLLMs on AV-ConfuseBench reveals pronounced flaws. For Audio-Muted Confusion:

Model	Acc ↑	Yes %
Video-LLaMA2-7B	2.73%	97.27
Baichuan-Omni-7B	5.47%	94.53
Qwen2.5-Omni-7B	9.59%	90.41
Gemini 2.5 Flash	28.76%	71.24
Gemini 2.5 Pro	68.50%	31.50
Random baseline	50.00%	50.00

All open-source omni-models perform near chance ( $<10\%$ accuracy), predominantly hallucinating audio features that correspond to the mute visual object. Even closed-source state-of-the-art models such as Gemini 2.5 Pro exhibit substantial visual bias, answering incorrectly with high frequency.

In Audio-Modified Confusion, both A-Acc and V-Acc remain low ( $\approx$ 1–4/5 scale), signifying systemic failures in matching the actual audio with the described visual content.

4. RL-CoMM: Reinforcement Learning-Based Collaborative Multi-MLLM

RL-CoMM is introduced as a countermeasure to visually dominated ambiguities in MLLMs. The architecture includes a policy model $\pi_\theta$ (Qwen2.5-Omni-3B) undergoing RL augmentation, and a reference model $\pi_{\text{ref}}$ (Large Audio LLM, LALM, e.g., Qwen2-Audio) that generates audio-only reasoning.

Training progresses through a staged protocol:

Warm-Up: Supervised fine-tuning on ~100 high-quality audio-visual QA pairs, enforcing structured output with segments <a-think>, <v-think>, <answer>.
Stage I (Step-wise Reasoning Reward): RL is orchestrated with composite scalar rewards:
1. Format Reward ( $r_\text{format}^i$ ): 1 if output matches tag structure; 0 otherwise.
2. Audio Reasoning Rationality (ARR) ( $r_\text{arr}^i$ ): 1 if cosine similarity $>$ 0.8 between policy audio-think and LALM reference audio-think, and the answer matches ground truth; 0 otherwise.
3. Audio-Visual Correlation (AVC) ( $r_\text{avc}^i$ ): Reward is 1 + coherence score if correct answer, coherence only if not, 0 otherwise. Semantic coherence is measured by Qwen3-Embedding. Total reward: $r^i = r^i_{\text{format}} + r^i_{\text{arr}} + r^i_{\text{avc}}$ .
Stage II (Answer-centered Confidence Optimization, Ans-CO): Combines negative log-likelihood loss with entropy minimization on answer tokens. $\mathcal{L}_\text{OP} = -\frac{1}{T}\sum_{t=1}^T \log \pi_\theta(o_t|o_{<t}, x) + \lambda \cdot \frac{1}{|N|}\sum_{t \in N} H_t$ (Eq. 4), reducing uncertainty.

Policy updates employ Grouped Reinforcement Policy Optimization (GRPO) gradient ascent, omitting KL penalty due to heterogeneous model structures.

5. Experimental Outcomes and Ablation Analyses

Empirical results demonstrate that RL-CoMM substantially mitigates audio-visual confusion in AV-ConfuseBench and related tasks. In Music-AVQA subtasks (Exist, Localize, Count, Compare, Temporal):

Method	Avg. Audio-Visual	AVQA
Qwen2.5-Omni-3B	54.95%	83.78%
+ SFT	70.41% (+15.46)	90.41%
+ GRPO	70.05%	85.31%
+ RL-CoMM (ours)	79.46% (+24.51)	95.87% (+12.09)

On AVQA-R (head/tail splits), RL-CoMM shows +10.52% improvement. On AVHBench (audio→video, video→audio, matching), RL-CoMM outperforms baselines by 5–13% absolute accuracy.

For AV-ConfuseBench:

Strategy	Audio-Muted Acc	Yes %	A-Acc	V-Acc
Qwen2.5-Omni-3B	8.22%	91.78%	1.14	4.10
+ SFT	5.48%	94.52%	–	–
+ GRPO	15.07%	84.93%	1.84	4.47
+ RL-CoMM	27.40% (+19.18)	72.60%	2.36	4.54

Ablation studies confirm that each individual module (format + accuracy reward, ARR + AVC, Ans-CO) contributes approximately 4–5% to mean accuracy, with composite improvements reaching +24.5% over the baseline.

Reward dynamics during RL training indicate steadily increasing AVC alignment, while ARR presents greater early fluctuations, consistent with initial difficulty in overcoming entrenched visual bias.

6. Insights and Implications

AV-ConfuseBench exposes a fundamental defect in current MLLMs: over-reliance on vision leads to consistent audio hallucination when the visual context and audio cues are dissonant or misaligned. RL-CoMM demonstrates that stimulating independent audio reasoning through LALM reference paths, together with semantic alignment rewards (ARR) and audio-visual correlation mechanisms (AVC), can partially overcome these limitations. Further, answer-centered entropy regularization yields output stabilization and reduces logical uncertainty. RL-CoMM achieves consistent 10–30% absolute improvement in cross-modal prompt accuracy with limited additional RL fine-tuning over a 3B-parameter backbone model.

A plausible implication is that future multimodal architectures should explicitly disentangle—and subsequently realign—audio and visual reasoning pathways, leveraging external reference models and reward engineering to address persistent confusion. This suggests the necessity of benchmarks like AV-ConfuseBench as a standard for measuring true cross-modal understanding and hallucination resistance in MLLMs (Ye et al., 13 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? (2025)

Follow Topic

Get notified by email when new papers are published related to AV-ConfuseBench.