Papers
Topics
Authors
Recent
2000 character limit reached

Audio-aware Large Language Models

Updated 30 June 2025
  • Audio-aware Large Language Models (ALLMs) are architectures that fuse audio processing with text generation to enable audio-question answering, though they are prone to object hallucination.
  • Audio-Aware Decoding (AAD) utilizes a contrastive decoding approach by comparing logits with actual versus blank audio to ensure outputs are firmly grounded in real audio inputs.
  • Empirical results reveal that AAD can significantly improve F1 scores and accuracy in audio QA tasks, demonstrating robust performance enhancements across different models.

Audio-aware LLMs (ALLMs) are architectures that combine audio input processing with the capabilities of LLMs to perform a wide range of audio-question answering and reasoning tasks. Despite strong performance on conventional benchmarks, ALLMs commonly suffer from object hallucination—producing outputs that describe sounds, objects, or events not actually present in the supplied audio. Addressing this, the method of Audio-Aware Decoding (AAD) has been proposed as a lightweight, inference-time solution that systematically reduces hallucination without model retraining or reliance on specific prompts.

1. Problem of Object Hallucination in ALLMs

Object hallucination in ALLMs arises when a model generates assertions (e.g., “there is a bird chirping”) unsupported or contradicted by the input audio. As shown on recent benchmarks, hallucination is prevalent even among state-of-the-art models such as SALMONN-7B/13B and Qwen2-Audio-7B, especially for yes/no sound detection or object presence questions. This failure mode is considered dangerous for high-stakes scenarios, such as assistive technology or automated scene analysis, where false positives can undermine trust and operational safety.

2. Audio-Aware Decoding (AAD): Contrastive Decoding Framework

Audio-Aware Decoding (AAD) is an inference-time strategy designed to reduce hallucination by making generation more tightly conditioned on the audio signal. AAD is based on contrastive decoding principles and functions by modifying the token selection process at each generation step using the difference in model confidence with and without the audio input.

Mechanism

At each decoding timestep tt, AAD computes:

logitwith-audio(t)=logit(ytA,x,y<t) logitwithout-audio(t)=logit(ytAblank,x,y<t)\begin{align*} \text{logit}^{(t)}_{\text{with-audio}} &= \text{logit}\left( y_t \mid \mathcal{A}, \mathbf{x}, \mathbf{y}_{<t} \right) \ \text{logit}^{(t)}_{\text{without-audio}} &= \text{logit}\left( y_t \mid \mathcal{A}_{\text{blank}}, \mathbf{x}, \mathbf{y}_{<t} \right) \end{align*}

where:

  • A\mathcal{A}: actual audio input,
  • Ablank\mathcal{A}_{\text{blank}}: a "blank" (silent) audio signal of equal duration.

The final token probability for step tt is then:

pAAD(t)=softmax[(1+α) logitwith-audio(t)α logitwithout-audio(t)]\boxed{ \mathbf{p}^{(t)}_{\text{AAD}} = \text{softmax} \left[ (1+\alpha) \ \text{logit}^{(t)}_{\text{with-audio}} - \alpha \ \text{logit}^{(t)}_{\text{without-audio}} \right] }

where α\alpha is a tunable scalar (typically 0.5 or 1.0) controlling the strength of audio-induced contrast. A higher α\alpha more aggressively promotes tokens whose likelihood increases with audio.

This process “anchors” the decoding process on evidence provided by the actual audio, promoting tokens that are truly supported by it and demoting generic or hallucinated continuations likely to occur by prior.

3. Empirical Evaluation and Performance Metrics

AAD was tested on the object hallucination benchmark as well as general audio QA benchmarks (e.g., Clotho-AQA), with three representative models:

Model Default F1 AAD (α=1.0\alpha=1.0) Absolute F1 Gain
SALMONN-7B 0.233 0.737 +0.504
SALMONN-13B 0.384 0.676 +0.292
Qwen2-Audio-7B 0.302 0.737 +0.435

In adversarial and popular object sampling regimes, F1 score improvements of up to +0.428 were observed. Across all splits, AAD consistently improved “no” F1 score (i.e., correct rejection of absent objects), often more than doubling performance compared to both default and prompt-engineering-only baselines.

On general Clotho-AQA, AAD provided an accuracy boost of 5.4%–10.3% over the default decoder, showing that its effect is not limited to hallucination mitigation and does not impair general audio QA ability.

4. Component Ablations and Robustness

AAD was subjected to in-depth ablations to isolate the causes for its improvement:

  • Contrastive strength (α\alpha): F1 improvement peaked at α1.0\alpha \approx 1.0, balancing faithfulness and recall. Larger α\alpha excessively suppressed affirmative answers.
  • Prefix prompt necessity: AAD gains are additive to those from explicit prompting focused on audio context; omitting the prompt induces a drop in F1 but AAD still provides improvement over default decoding.
  • Prompt sensitivity: Even with minimal or generic prompts (e.g., “Listen.”), AAD robustly outperformed non-contrastive methods, whereas prompt-only baselines saw sharp degradation.

This indicates the method is robust to prompt selection—a notable advantage over earlier prompt-centric mitigation strategies.

5. Practical Implications and Limitations

AAD offers several practical advantages:

  • Inference-time deployment: Requires no retraining or modification of model weights. It can be adopted in black-box settings, with only parallel model inference (with and without audio).
  • Prompt-robust: Substantially reduces sensitivity to prompt wording or engineering.
  • General applicability: Benefits a range of architectures and scenarios, as demonstrated with both SALMONN and Qwen2-Audio models.

Limitations include:

  • Increased computation: Generation time is roughly doubled, since two model forward passes are required per decoding step. For latency-sensitive deployments, this overhead should be considered.
  • Task specificity: Evaluated primarily for yes/no QA and object detection; extension to open-ended tasks or complex generation remains to be validated.
  • Audio “blanking”: The choice of a blank/silence audio as the negative context is an engineering decision; alternatives may be explored for robustness.

6. Future Directions

Future work may include:

  • Extending AAD to open-form QA, captioning, and multi-turn audio dialogues.
  • Reducing computational cost, perhaps by approximation or batched/partial contrastive evaluation.
  • Exploring “semantic blanking” strategies for different audio domains.

A plausible implication is that AAD or similar contrastive decoding techniques may become a standard post-processing layer for trustworthy inference in ALLMs, applying to other modalities beyond audio.


Key Equation: Audio-Aware Contrastive Decoding

pAAD(t)=softmax[(1+α)logitwith-audio(t)αlogitwithout-audio(t)]\mathbf{p}^{(t)}_{\text{AAD}} = \text{softmax} \left[ (1+\alpha) \text{logit}_{\text{with-audio}}^{(t)} - \alpha \text{logit}_{\text{without-audio}}^{(t)} \right]

Performance Highlights

  • F1 score increases by 0.046–0.428 on object hallucination benchmarks compared to default decoding.
  • 5.4–10.3% accuracy gain on general audio QA tasks.
  • Robustness to prompt engineering and model choice.

AAD represents an inference-centric solution for reducing hallucination in ALLMs, emphasizing alignment between model outputs and actual audio evidence.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Audio-aware Large Language Models (ALLMs).