Audio Hallucination Attacks: Probing the Reliability of Large Audio Language Models

Published 31 Mar 2026 in cs.SD | (2603.29263v1)

Abstract: Large Audio LLMs (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces Audio Hallucination Attacks (AHA) to systematically probe LALM vulnerabilities with both explicit and implicit query methods.
The evaluation reveals that implicit and audio-based attacks significantly escalate hallucination rates, with ASR increasing notably (e.g., Gemini 3 Pro from 10.88% to 59.67%).
Mitigation strategies, such as Direct Preference Optimization on the AHA-Guard dataset, substantially reduce ASR, emphasizing the need for better audio-grounded model validation.

Audio Hallucination Attacks: Evaluating the Reliability of Large Audio LLMs

Introduction

This work presents a systematic evaluation of Large Audio LLMs (LALMs) under adversarial and contextually complex listening scenarios, introducing Audio Hallucination Attacks (AHA) as a new suite of evaluation and training resources. The study targets a critical yet previously under-explored reliability gap: the tendency of LALMs to generate hallucinated responses concerning absent sounds, especially when queries presuppose nonexistent audio events. Standard benchmarks often fail to probe this vulnerability, as they are limited to explicit questioning modes that do not reflect the semantic traps encountered in realistic user interactions.

AHA Framework and Benchmark Suite

AHA introduces a set of attack methodologies that systematically target the grounding abilities of LALMs. The framework considers two main attack surfaces:

Query-based attacks: Divided into explicit attacks (e.g., "Is there a dog barking in the audio?") and implicit attacks (e.g., "How loud is the dog barking?"), the latter of which critically assesses whether the model grounds its inference in the actual audio or relies excessively on language priors.
Audio-based attacks: These manipulate the input stream by prepending text-to-speech utterances that mention nonexistent events to the audio, priming models to accept the hallucinated event as true.
Figure 1: The AHA data curation and attack generation pipeline, including filtering, hallucinated sound generation, QA construction, and the injection of manipulative audio cues.

The data curation pipeline draws from three diverse corpora—AudioCaps, Clotho, and MusicCaps—and uses LLM-based filtering to ensure high-annotation agreement for each clip. For each filtered clip, hallucinated sound events are generated in two taxonomies: contextually adversarial (plausible but absent) and random (implausible and absent). Both explicit and implicit queries are derived for every hallucinated event, and manipulations are performed at both the text and audio modality levels.

AHA-Eval, the evaluation benchmark, contains 6.5K query-based and audio-based attack pairs. Complementing this, AHA-Guard (120K QA preference pairs) is designed for post-alignment training and addresses the need for robust fine-tuning datasets that do not induce trivial rejection biases.

Experimental Protocol

The evaluation covers a spectrum of state-of-the-art LALMs, including open-source models (Qwen2.5-Omni, Qwen3-Omni, R1-AQA, Audio Flamingo 3) and closed-source counterparts (Gemini 3 Pro, GPT-4 Audio). Attack Success Rate (ASR) is used as the central metric, employing an LLM-as-Judge protocol with demonstrated high agreement to human annotators, thus ensuring the scalability and reliability of the evaluation.

The pipeline guarantees strict train–test separation by segregating data sources for AHA-Eval and AHA-Guard. Training interventions include test-time Chain-of-Thought (CoT) prompting and alignment-phase Direct Preference Optimization (DPO) fine-tuning.

Results and Analysis

Hallucination Vulnerabilities

The ASR results show persistent vulnerabilities across all model families:

Implicit attacks are dramatically more effective than explicit. For instance, Gemini 3 Pro's ASR on random hallucinated sounds increases from 10.88% (explicit) to 59.67% (implicit). The effect is consistent across models and attack surfaces.
Figure 2: Visual comparison of explicit versus implicit query vulnerabilities; LALMs often fail to reject the presupposed existence of hallucinated sounds under implicit queries.
Audio-based attacks are more potent than text-based attacks. Even when models show relative robustness to explicit queries, manipulation at the audio stream level (e.g., speech primes about nonexistent events) induces high hallucination rates.
Contextually adversarial (plausible) absent events further increase ASR, with models more likely to hallucinate when the fabricated event matches language priors.
Qualitative examples (Figure 3) expose the persistence and escalation of hallucinations in multi-turn dialogues when primed with adversarial cues.
Figure 3: A Gemini 3 Pro conversation in which exposure to manipulated audio primes leads to compounding erroneous inferences about nonexistent sounds.

Hallucination Mechanisms

Analysis of model internals reveals:

Reduced attention to true audio tokens during implicit attacks. Model intervention studies show mean attention weights on audio input decrease as queries become more implicit, indicating the model relies more on language priors than auditory verification.
Higher confidence in hallucinated responses with adversarial sounds. ASR correlates with increased logit confidence for 'yes' tokens when presupposed events are contextually plausible.
Figure 4: (Top) LALMs attend less to audio tokens during implicit attacks. (Bottom) Models display higher confidence in hallucinated "yes" responses, especially for contextually plausible hallucinations.

Strategies for Hallucination Mitigation

Evaluation of mitigation techniques reveals limited efficacy for test-time intervention and emphasizes the importance of alignment-phase data augmentation:

Chain-of-Thought (CoT) prompting slightly improves explicit attack resistance but often increases susceptibility to implicit attacks.
Direct Preference Optimization (DPO) on AHA-Guard yields significant ASR reductions, particularly for implicit queries, reducing ASR in Qwen2.5-Omni by up to 49% without introducing a generic rejection bias.

Theoretical and Practical Implications

The findings delineate a critical misalignment: current LALMs systematically omit an audio-grounded verification step, especially under implicit or contextually loaded queries. The study exposes the limitations of prevailing evaluation benchmarks and the necessity of adversarial and nuanced probing for measuring true grounding capabilities. Practically, these reliability gaps pose risks for the deployment of LALMs in interactive settings, including assistive technologies, security-sensitive applications, and automated transcription or summarization.

On the theoretical side, the results highlight the importance of joint training and alignment strategies that explicitly penalize language-prior-driven generations and reward evidence-based validation, especially for compositional, multi-modal reasoning.

The proposed AHA-Guard dataset and associated mitigation pipeline also provide a foundation for further research into robust alignment under adversarial and contextually deceptive input conditions, broadening the prospects for truly multimodal alignment protocols.

Conclusion

Audio Hallucination Attacks (AHA) represent a comprehensive suite for stress-testing the robustness of LALMs. Through carefully constructed explicit/implicit queries and manipulative audio cues, AHA reveals a substantial hallucination vulnerability that is not captured by current benchmarks. The work advocates for adversarial and context-aware evaluation protocols and demonstrates that training-phase alignment using rich adversarial datasets like AHA-Guard can substantially improve model reliability. These findings have immediate implications for the evaluation, deployment, and further development of robust audio-language systems.

Markdown Report Issue