Papers
Topics
Authors
Recent
2000 character limit reached

Audio Hallucination QA Dataset

Updated 17 October 2025
  • Audio Hallucination QA Dataset is a benchmark that quantifies hallucinations in audio-language models using standardized QA queries and expert human annotations.
  • It employs a three-type taxonomy to classify output errors and assesses performance using metrics like the F1 score, precision, and recall.
  • The dataset enables evaluation of mitigation strategies such as Audio-Aware Decoding and Adaptive Vector Steering to improve audio grounding in multimodal systems.

The Audio Hallucination QA Dataset is a specialized benchmark designed to systematically evaluate and analyze hallucinations arising in large audio-language and multimodal models when answering questions about audio content. Hallucinations, in this context, refer to model-generated outputs that are not grounded in the actual audio signal but instead reflect spurious, invented, or visually biased information. The dataset enables rigorous measurement of hallucination prevalence, type categorization, and assessment of mitigation methods across diverse model architectures and tasks.

1. Dataset Construction and Annotation Protocol

The canonical Audio Hallucination QA Dataset, as established in "On the Audio Hallucinations in Large Audio-Video LLMs" (Nishimura et al., 18 Jan 2024), comprises 1,000 QA instances derived by prompting Video LLAMA on videos from the FAVDBench test split. Each instance employs the standardized query “What do you hear?”, focusing the model on audio perception over visual cues. Output sentences are annotated by human experts with binary labels: hallucinated (audio description generated without support from actual audio cues) or non-hallucinated. For hallucinated responses, secondary annotation specifies one of three taxonomy-driven hallucination types: Type (A) – both object and action hallucinated, Type (B) – object correct, action hallucinated, Type (C) – action correct, object hallucinated.

Split Total Instances Hallucinated Hallucination Categorization
FAVDBench QA 1,000 323–332 Types (A), (B), (C)

The annotation protocol includes noun and verb extraction (excluding formulaic phrases), facilitating statistical analysis of specific error signatures for each hallucination type.

2. Hallucination Taxonomies and Benchmark Extensions

Taxonomical organization of hallucination types, validated across several works (Sahoo et al., 15 May 2024, Kuan et al., 21 Oct 2024, Sung-Bin et al., 23 Oct 2024), underpins benchmark design and cross-paper comparisons. The three-type schema has become canonical for audio hallucinations in multimodal settings, supported by both classification experiments and error analyses:

Type Error Manifestation Example Key Characteristics
A Objects and actions hallucinated "a baby crying" Generic/ambient hallucinations
B Object correct, action hallucinated "woman playing a harp" Action misattribution
C Action correct, object hallucinated "trumpet" vs "tuba" Instrument/source substitution

Further extensions (e.g., AVHBench (Sung-Bin et al., 23 Oct 2024)) introduce multiple judgment tasks: audio-driven video hallucination, video-driven audio hallucination, audio-visual matching, and audio-visual captioning. These tasks enable disambiguation of cross-modal confounds and compositional reasoning errors, particularly relevant for multi-sensory environments.

3. Model Evaluation Protocols and Classification Methodologies

The dataset enables evaluation via both zero-shot embedding-based classifiers and fine-tuning pipelines. In embedding-based approaches, pre-trained encoders (MS-CLAP, LAION-CLAP) project audio (Ea(A)E_a(A)) and textual (Et(T)E_t(T)) features into a joint space; cosine similarity (cos(ha,ht)\cos(h_a, h_t)) below a threshold α\alpha implies hallucination. Fine-tuned models employ MLP layers atop frozen encoders, with element-wise combination and sigmoid-based prediction:

h^a=Fa(ha),h^t=Ft(ht),h^at=h^ah^t,y^=Fat(h^at),L=BCE(y^,y)\hat{h}_a = F_a(h_a), \quad \hat{h}_t = F_t(h_t), \quad \hat{h}_{at} = \hat{h}_a \odot \hat{h}_t, \quad \hat{y} = F_{at}(\hat{h}_{at}), \quad L = \text{BCE}(\hat{y}, y)

Experimental results indicate fine-tuned MS-CLAP achieves 87.9%87.9\% F1_1, decisively surpassing zero-shot (approx. $52.2$–52.9%52.9\% F1_1) and random baseline (40.3%40.3\% F1_1).

Benchmarks such as AVHBench utilize accuracy, precision, recall, and F1_1 for binary tasks, and METEOR/CIDEr/GAVIE-A for captioning. Paired and before-after question design (as in MATCH (Kuan et al., 21 Oct 2024)) enables nuanced analysis of response discrimination and consistency (C-C, C-I metrics).

4. Hallucination Mitigation Strategies and Empirical Safeguards

Recent mitigation approaches include:

pAAD(t)=softmax((1+α)logitwith audio(t)αlogitwithout audio(t))p_\text{AAD}^{(t)} = \text{softmax}\left((1+\alpha) \cdot \text{logit}_\text{with audio}^{(t)} - \alpha \cdot \text{logit}_\text{without audio}^{(t)}\right)

Empirically, AAD achieves F1_1 increases from $0.046$ to $0.428$ over baseline on object hallucination datasets.

  • Adaptive Vector Steering (AVS) (Lin et al., 14 Oct 2025): Layer-wise steering of model activations using contrastive representations (audio vs. silent input) with adaptive intervention strength. Later layers receive higher steering weights, calibrated by effect size analysis. AVS yields F1_1 improvements (Gemma: 0.5500.6190.550 \to 0.619, Qwen: 0.6260.6320.626 \to 0.632) and 8%8\% accuracy gain on MMAU.
  • Post-processing in ASR (Barański et al., 20 Jan 2025): The Bag of Hallucinations (BoH) captures frequent spurious outputs; combined with delooping and string search (Aho–Corasick), post-processing robustly suppresses recurring ASR hallucinations, reducing WER by up to 6.59.4%6.5–9.4\%.
  • Attention-Guided Explainable Scores (Huang et al., 21 May 2025): RePPL attributes token-wise uncertainty to semantic propagation and generation. The uncertainty scores detect chaotic patterns typical of hallucination and can be mapped to audio QA by aligning scores with spectrogram regions or transcribed tokens.

5. Cross-Modal and Acoustic Condition Benchmarks

Benchmarks such as RePOPE-Spk (Park et al., 19 Sep 2025) extend evaluation to spoken queries under variable acoustic conditions. Hallucination error rates escalate as queries shift from text to speech (by 3%3\% under clean speech, up to 20%20\% with environmental noise). Experiments demonstrate that input order and query duration affect robustness, with longer spoken queries helping mitigate but not eliminate performance drops.

AVHBench (Sung-Bin et al., 23 Oct 2024) systematically addresses cross-modal hallucinations in audio-visual LLMs. It shows models often perform near chance on multimodal settings, but unimodal or text-converted inputs enhance reliability. Ablations reveal that improved audio-to-LLM alignment and LoRA fine-tuning dramatically boost F1_1 and captioning metrics.

6. Broader Implications and Future Directions

The Audio Hallucination QA Dataset exposes specific weaknesses in current model architectures, notably the over-reliance on visual cues, susceptibility to compositional confusion, and vulnerability to acoustic noise and input ordering in voice-driven interfaces. Improvements in model design—closer integration of audio features, refined multi-turn and chain-of-thought prompting, and informed post-processing—are advancing reliability but not fully solving the issue.

Future work is anticipated to extend benchmarks with more challenging, diverse scenarios and to investigate methods that enhance modality-specific attention, noise robustness, and dynamic input handling. The taxonomy-driven evaluation and systematic mitigation approaches established in these datasets will inform both diagnostic toolchain development and the training of more grounded, trustworthy audio and multimodal LLMs.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Audio Hallucination QA Dataset.