Audio-aware Large Language Models
- Audio-aware Large Language Models (ALLMs) are architectures that fuse audio processing with text generation to enable audio-question answering, though they are prone to object hallucination.
- Audio-Aware Decoding (AAD) utilizes a contrastive decoding approach by comparing logits with actual versus blank audio to ensure outputs are firmly grounded in real audio inputs.
- Empirical results reveal that AAD can significantly improve F1 scores and accuracy in audio QA tasks, demonstrating robust performance enhancements across different models.
Audio-aware LLMs (ALLMs) are architectures that combine audio input processing with the capabilities of LLMs to perform a wide range of audio-question answering and reasoning tasks. Despite strong performance on conventional benchmarks, ALLMs commonly suffer from object hallucination—producing outputs that describe sounds, objects, or events not actually present in the supplied audio. Addressing this, the method of Audio-Aware Decoding (AAD) has been proposed as a lightweight, inference-time solution that systematically reduces hallucination without model retraining or reliance on specific prompts.
1. Problem of Object Hallucination in ALLMs
Object hallucination in ALLMs arises when a model generates assertions (e.g., “there is a bird chirping”) unsupported or contradicted by the input audio. As shown on recent benchmarks, hallucination is prevalent even among state-of-the-art models such as SALMONN-7B/13B and Qwen2-Audio-7B, especially for yes/no sound detection or object presence questions. This failure mode is considered dangerous for high-stakes scenarios, such as assistive technology or automated scene analysis, where false positives can undermine trust and operational safety.
2. Audio-Aware Decoding (AAD): Contrastive Decoding Framework
Audio-Aware Decoding (AAD) is an inference-time strategy designed to reduce hallucination by making generation more tightly conditioned on the audio signal. AAD is based on contrastive decoding principles and functions by modifying the token selection process at each generation step using the difference in model confidence with and without the audio input.
Mechanism
At each decoding timestep , AAD computes:
where:
- : actual audio input,
- : a "blank" (silent) audio signal of equal duration.
The final token probability for step is then:
where is a tunable scalar (typically 0.5 or 1.0) controlling the strength of audio-induced contrast. A higher more aggressively promotes tokens whose likelihood increases with audio.
This process “anchors” the decoding process on evidence provided by the actual audio, promoting tokens that are truly supported by it and demoting generic or hallucinated continuations likely to occur by prior.
3. Empirical Evaluation and Performance Metrics
AAD was tested on the object hallucination benchmark as well as general audio QA benchmarks (e.g., Clotho-AQA), with three representative models:
| Model | Default F1 | AAD () | Absolute F1 Gain |
|---|---|---|---|
| SALMONN-7B | 0.233 | 0.737 | +0.504 |
| SALMONN-13B | 0.384 | 0.676 | +0.292 |
| Qwen2-Audio-7B | 0.302 | 0.737 | +0.435 |
In adversarial and popular object sampling regimes, F1 score improvements of up to +0.428 were observed. Across all splits, AAD consistently improved “no” F1 score (i.e., correct rejection of absent objects), often more than doubling performance compared to both default and prompt-engineering-only baselines.
On general Clotho-AQA, AAD provided an accuracy boost of 5.4%–10.3% over the default decoder, showing that its effect is not limited to hallucination mitigation and does not impair general audio QA ability.
4. Component Ablations and Robustness
AAD was subjected to in-depth ablations to isolate the causes for its improvement:
- Contrastive strength (): F1 improvement peaked at , balancing faithfulness and recall. Larger excessively suppressed affirmative answers.
- Prefix prompt necessity: AAD gains are additive to those from explicit prompting focused on audio context; omitting the prompt induces a drop in F1 but AAD still provides improvement over default decoding.
- Prompt sensitivity: Even with minimal or generic prompts (e.g., “Listen.”), AAD robustly outperformed non-contrastive methods, whereas prompt-only baselines saw sharp degradation.
This indicates the method is robust to prompt selection—a notable advantage over earlier prompt-centric mitigation strategies.
5. Practical Implications and Limitations
AAD offers several practical advantages:
- Inference-time deployment: Requires no retraining or modification of model weights. It can be adopted in black-box settings, with only parallel model inference (with and without audio).
- Prompt-robust: Substantially reduces sensitivity to prompt wording or engineering.
- General applicability: Benefits a range of architectures and scenarios, as demonstrated with both SALMONN and Qwen2-Audio models.
Limitations include:
- Increased computation: Generation time is roughly doubled, since two model forward passes are required per decoding step. For latency-sensitive deployments, this overhead should be considered.
- Task specificity: Evaluated primarily for yes/no QA and object detection; extension to open-ended tasks or complex generation remains to be validated.
- Audio “blanking”: The choice of a blank/silence audio as the negative context is an engineering decision; alternatives may be explored for robustness.
6. Future Directions
Future work may include:
- Extending AAD to open-form QA, captioning, and multi-turn audio dialogues.
- Reducing computational cost, perhaps by approximation or batched/partial contrastive evaluation.
- Exploring “semantic blanking” strategies for different audio domains.
A plausible implication is that AAD or similar contrastive decoding techniques may become a standard post-processing layer for trustworthy inference in ALLMs, applying to other modalities beyond audio.
Key Equation: Audio-Aware Contrastive Decoding
Performance Highlights
- F1 score increases by 0.046–0.428 on object hallucination benchmarks compared to default decoding.
- 5.4–10.3% accuracy gain on general audio QA tasks.
- Robustness to prompt engineering and model choice.
AAD represents an inference-centric solution for reducing hallucination in ALLMs, emphasizing alignment between model outputs and actual audio evidence.