Large Audio-Language Models (LALMs)
- Large Audio-Language Models (LALMs) are advanced multimodal systems that fuse audio and text processing using specialized audio encoders and LLM backbones.
- They employ modular architectures and innovations like soft prompting, state-space models, and dual-codebook tokenization to enhance cross-modal understanding and performance.
- Ongoing challenges include modality fusion, audio hallucination, text bias, and adversarial robustness, which steer future research in training and evaluation.
Large Audio-LLMs (LALMs) are multimodal machine learning systems that extend LLMs with the ability to perceive, process, and reason over audio signals alongside text. These models are designed to address a broad range of tasks involving speech, environmental sounds, music, and multimodal dialogue, supporting applications from audio question answering to complex spoken language reasoning and robust human–computer interaction. Their development is characterized by innovations in model design, benchmarking, robustness evaluation, and post-training strategies, as well as persistent challenges in modality fusion, bias, reliability, and generalization.
1. Model Architectures and Modality Integration
LALMs classically adopt a modular structure where audio encoders (e.g., Audio Spectrogram Transformer, Whisper, DASS) are coupled with large LLM backbones (e.g., Llama, GPT-family), sometimes via modality adapters like Q-Former blocks. Advanced models such as GAMA integrate information from multiple intermediate layers of audio encoders—aggregating both surface-level and semantic cues through attention and feedforward fusion mechanisms (Ghosh et al., 17 Jun 2024). Soft prompt techniques inject high-level audio semantics, such as event tags, into language prompts, enhancing reasoning about complex acoustic signals.
State-space models (SSMs) have emerged as an alternative to transformers in LALMs, offering linear time and memory complexity and efficient propagation of long audio sequences. SSMs can be employed in both perception and language modules, either fully replacing transformers (ssLALM) or in hybrid configurations with LoRA-based adapters to minimize parameter overhead while preserving competitive performance (Bhati et al., 24 Nov 2024).
End-to-end LALMs, exemplified by Step-Audio-AQAA, integrate dual-codebook tokenization (linguistic and semantic) and neural vocoders to support Audio Query–Audio Answer (AQAA) in a single, coherent pipeline—enabling both input and output in the audio domain and supporting joint text–audio generation (Huang et al., 10 Jun 2025).
2. Benchmarking and Evaluation Methodologies
Robust benchmarking is central to measuring LALM capabilities and limitations:
- Closed and Open-ended Audio QA: AIR-Bench is a prominent benchmark that spans foundation (19 single-task audio challenges, ~19k queries) and chat (2k open-ended QAs) settings, employing unified GPT-4-based automated evaluation of generated outputs with human-level alignment (Yang et al., 12 Feb 2024).
- Object Hallucination: Discriminative (binary “Yes”/“No”) and generative (captioning) tasks quantify the model's tendency to hallucinate non-existent audio objects using metrics like ECHO_I (instance-level hallucination), ECHO_S (sentence-level), and cover rates (Kuan et al., 12 Jun 2024).
- Joint Modal Understanding: SSEU-Bench evaluates ASR, acoustic scene classification, and event tagging under both independent and joint settings, with explicit control over foreground–background energy and SNR, simulating real-world auditory mixtures (Yin et al., 16 Sep 2025).
- Dialogue and Ambiguity: ADU-Bench assesses open-ended audio dialogue across scenarios, multilinguality, skill domains (math, coding, law, etc.), and ambiguity types (intonation, pause, homophone), with LLM-based scoring (Gao et al., 6 Dec 2024).
- Conflict Resolution: MCR-Bench systematically exposes LALMs to faithful, adversarial, and irrelevant text paired with audio to measure text bias and the model's ability to resolve cross-modal conflict (Wang et al., 21 Aug 2025).
- Systematic Evaluation Toolkits: LALM-Eval is an open-source, high-throughput evaluation system supporting 380+ tasks, standardized prompt protocols, and novel categories such as LLM-adaptive diarization and spoken language reasoning, revealing persistent model gaps in temporality and reasoning (Surapaneni et al., 9 Sep 2025).
Current evaluation practice leverages LLM-as-judge protocols (e.g., GPT-4), which achieve >96% agreement with human rater data, but positional bias and reproducibility are ongoing concerns.
3. Audio Hallucination, Modality Bias, and Reliability
A persistent limitation of LALMs is their susceptibility to audio hallucination—generating output not grounded in the provided audio. Comparative evaluations show that:
- Models demonstrate high-quality audio captioning but struggle with discriminative (presence/absence) tasks, often yielding high precision and low recall, reflecting a “Yes” bias regardless of ground truth (Kuan et al., 12 Jun 2024).
- Hallucinations extend to silent inputs, with models occasionally outputting detailed scene descriptions for audio containing no content at all (Kuan et al., 21 Oct 2024).
- Contrastive decoding via Audio-Aware Decoding (AAD) reduces hallucination by comparing the predicted token distributions with and without the real audio input, promoting output grounded in auditory evidence and yielding up to 0.428 F1 score improvements on object hallucination tasks (Hsu et al., 8 Jun 2025).
Text bias is another acute issue: LALMs systematically prioritize textual input in the presence of modality conflict, with adversarial text inputs reducing accuracy from >85% to below 2% in audio-centric tasks; models remain overconfident in such settings (as measured by maximal token probabilities) (Wang et al., 21 Aug 2025). Supervised fine-tuning with conflict-enriched data and bias-aware prompting modestly mitigates but does not eliminate this susceptibility.
Reliability in LALMs (the ability to recognize their knowledge boundaries and refuse when uncertain) can be enhanced through multi-modal chain-of-thought (MCoT) prompting, which decomposes reasoning steps and increases transparency, as well as through supervised fine-tuning on curated “I don’t know” (IDK) datasets (Ma et al., 25 May 2025). The Reliability Gain Index (RGI) provides a nuanced metric for balancing conservative (rejection of correct) versus humble (rejection of incorrect) behavior.
4. Training Paradigms, Data Construction, and Post-Training
Contemporary LALMs are trained on large-scale, multimodal corpora. Key innovations include:
- Text-only Supervision: MATS demonstrates that, via the use of pre-trained CLAP encoders and a memory-based Santa mechanism for bridging modality, text-only training suffices to endow the LLM with audio comprehension (Wang et al., 19 Feb 2025). This reduces data collection costs and achieves competitive zero-shot and open-ended performance.
- Self-Generated Cross-Modal Alignment: DeSTA2.5-Audio introduces a pipeline where the backbone LLM generates its own training targets (responses) using structured metadata and an instruction pool. This approach maintains stylistic consistency and minimizes catastrophic forgetting of language ability, enabling strong zero-shot generalization and efficient training relative to data scale (Lu et al., 3 Jul 2025).
- Audio-Contribution-Aware Post-Training: Many LALMs achieve strong test accuracy by leveraging textual content alone (“zero audio-contribution”) rather than audio. AudioMCQ provides a diagnostic dataset for this phenomenon, and Audio-Contribution Filtering partitions training data into weak and strong subsets. Training regimes such as Weak-to-Strong (SFT on text-easy, RL on audio-hard) and Mixed-to-Strong have achieved new state-of-the-art in audio QA (He et al., 25 Sep 2025).
Curriculum learning and direct mixing (gradually incorporating small amounts of speech) are effective for improving spoken language understanding in low-resource settings, with curriculum learning providing more stable improvements when paired speech data is extremely limited (Choi et al., 18 Sep 2025).
5. Robustness, Safety, and Adversarial Threats
Robustness evaluation is increasingly recognized as essential for LALMs deployed in adversarial or noisy settings:
- Audio Injection Attacks: Systematic evaluations across major threat scenarios—audio interference, instruction following, context injection, and judgment hijacking—reveal no single model excels across all threats. Attack effectiveness is highly sensitive to the position of injected content, with early-sequence placement being most damaging. An inverse correlation exists between strong instruction-following ability and adversarial robustness; safety-aligned models tend to be less susceptible (Hou et al., 26 May 2025).
- Stealthy Adversarial Jailbreaks: AdvWave introduces a dual-phase optimization (token and waveform level) enabling effective end-to-end adversarial attacks even in the presence of discretizing (non-differentiable) audio encoders. The approach leverages gradient retention loss and classifier-guided perceptual constraints to create audio perturbations that are naturalistic yet result in illicit model behavior, revealing a 40% higher jailbreak success rate than prior baselines (Kang et al., 11 Dec 2024).
- Meta Ability of Reliability: The ability to confidently reject uncertain queries when poorly informed is shown to transfer across modalities (e.g., from speech to music), suggesting the emergence of transferable “meta abilities” valuable for more generalizable and robust AI (Ma et al., 25 May 2025).
6. Advanced Reasoning, Dialogue, and Joint Modal Tasks
Contemporary LALMs are increasingly evaluated on complex reasoning and dialogue understanding:
- Complex Reasoning: CompA-R and CompA-R-test datasets support instruction tuning geared for high-level, multi-hop reasoning over audio, with GAMA and related models achieving improvements of 1–84% over prior art on classification and QA (Ghosh et al., 17 Jun 2024).
- Open-Ended Audio Dialogue: ADU-Bench advances the evaluation of LALMs’ capabilities in multilingual, skill-specific, and ambiguous spoken dialogues, revealing marked gaps in performance on skills requiring symbolic reasoning (mathematics, coding), human-behavioral insight (roleplay), and nuanced acoustic ambiguity (intonation, pause, homophone) (Gao et al., 6 Dec 2024).
- Joint Audio Understanding: SSEU-Bench uniquely evaluates models on the joint recognition of speech, scenes, and events within a single clip, simulating real-world mixtures with controlled SNR; chain-of-thought decomposed inference improves performance over monolithic outputs, highlighting the need for explicit intermediate reasoning representations (Yin et al., 16 Sep 2025).
7. Future Directions and Open Challenges
The field continues to evolve with several critical areas for improvement:
- Enhanced Fusion and Modality Balancing: Mitigating the dominance of text in multimodal fusion and achieving better calibration between modalities is essential for trustworthy deployment, especially where text might be adversarial or noisy (Wang et al., 21 Aug 2025).
- Efficient and Adaptive Training: Leveraging data-efficient and self-aligned data construction (e.g., DeSTA2.5-Audio), curriculum learning, and text-only strategies (MATS) can greatly reduce annotation burden and improve cross-lingual and low-resource performance (Wang et al., 19 Feb 2025, Lu et al., 3 Jul 2025, Choi et al., 18 Sep 2025).
- Robustness and Defenses: Architectural and method-level innovations such as integrating both SSMs and transformers, developing tailored system prompts, and constructing multi-modal defense mechanisms are crucial to resisting adversarial manipulation and cross-modal attacks (Hou et al., 26 May 2025, Bhati et al., 24 Nov 2024).
- Refined Evaluation: Continued development of holistic, standardized evaluation toolkits (e.g., LALM-Eval), improved human–aligned metrics, and systematic benchmarks that account for ambiguity, joint reasoning, and real-world complexity are necessary for progress (Surapaneni et al., 9 Sep 2025).
In sum, Large Audio-LLMs are establishing themselves as a new multimodal foundation technology, enabled by a confluence of architectural, training, and evaluation innovations. Persistent issues with hallucination, text bias, robustness, and generalization motivate ongoing research into cross-modal fusion, data construction, adaptive post-training, and systematic evaluation. Robust, general-purpose LALMs will require additional improvements in reliability, multi-stage reasoning, adversarial resistance, and the nuanced integration of auditory and linguistic modalities across real-world application domains.