Audio-Text LLMs: Bridging Audio and Text

Updated 30 June 2025

Audio-Text LLMs are AI models that seamlessly fuse auditory and textual modalities, enabling advanced speech recognition, synthesis, and retrieval tasks.
They employ unified token-based, hybrid, and adapter-driven architectures to map audio signals into language model token spaces for effective in-context learning.
Innovative training strategies and large-scale data automation drive robust cross-modal alignment and safe deployment across applications like translation, music production, and interactive systems.

Audio-Text LLMs are a category of artificial intelligence systems designed to jointly process, relate, and generate information across auditory and textual modalities. These models extend the foundational architecture and techniques of LLMs to tasks that require comprehending, generating, or aligning speech, environmental sounds, music, or other audio signals with natural language. This field has rapidly expanded, now encompassing diverse applications such as speech recognition, text-conditioned audio generation, multimodal retrieval, music production, and the compositional synthesis of audio from text. The progress is driven by advances in model architectures, representation learning, large-scale dataset construction, and innovative integration strategies linking audio and text spaces.

1. Model Architectures and Modality Integration

Audio-Text LLMs span a variety of architectures that reflect different integration philosophies:

Unified Token-Based Architectures: Systems such as AudioPaLM jointly model text and audio as sequences of discrete tokens, extending a text LLM’s vocabulary to include audio tokens derived via k-means clustering or similar quantization techniques. AudioPaLM fuses a pre-trained text-only LLM (PaLM-2) with a speech-based model (AudioLM), allowing the Transformer decoder to process and generate both modalities (Rubenstein et al., 2023). This approach supports applications including ASR, speech-to-speech (S2ST), and text-to-speech (TTS), and retains the flexibility of a shared, modality-agnostic backend.
Hybrid and Coupled Architectures: Other models focus on explicit separation and coupling, where a pre-trained LLM is used as a text encoder and deployed in conjunction with a specialized speech codec model. For instance, the most effective approach for endowing speech synthesis capability is to couple a LLM (as a text semantic encoder) with a speech synthesis decoder (such as VALL-E), passing the LLM’s embeddings directly as input to the speech decoder. Experimental evidence demonstrates that this “coupled” method outperforms treating codec tokens as a language modeling target (Hao et al., 2023).
Adapter-Based and Prompt-Driven Methods: Composition via lightweight adaptation modules—such as LoRA (Low-Rank Adaptation)—enables parameter-efficient integration of audio and text components. LoRA is frequently used to adapt LLMs’ attention layers to audio-text fusion with minimal additional trainable parameters, supporting architectures for ASR (Fathullah et al., 2023), audio-visual speech recognition (Cappellazzo et al., 18 Sep 2024), and multimodal recommender systems (Qin, 13 Sep 2024).
Cross-Modal Representation Mapping: Some systems, such as UniAudio 1.5, employ learned codecs to quantize audio directly into an LLM’s pre-existing token space, effectively treating audio as a “foreign language” and enabling in-context, few-shot audio tasks without retraining (Yang et al., 14 Jun 2024). Others like MATS rely on a pre-trained audio-language alignment encoder (CLAP), training LLMs solely on text and projecting audio representations into the LLM’s space at inference time using adapters and the SANTA modality-transfer mechanism (Wang et al., 19 Feb 2025).

2. Key Methodologies: Learning, Alignment, and Data Strategies

Supervised Paired Learning: The majority of early and current systems rely on paired audio-text datasets for learning cross-modal correspondences. This is essential for T2A, S2ST, and TTS tasks. However, such datasets are limited in size compared to image-text resources.
Text-Only Supervision and Bridging Modalities: MATS demonstrates that training on text-only (using CLAP embeddings) can transfer audio comprehension to an LLM, provided careful alignment and representation transfer are performed during inference. The SANTA algorithm fuses audio embeddings with semantically related noisy text representations to minimize modality gaps and support open-ended QA and captioning (Wang et al., 19 Feb 2025).
Structured Event Extraction and Augmentation: Make-An-Audio 2 employs large LLMs to extract and encode explicit event-order pairs (<event, order>) from free-form captions, improving semantic and (especially) temporal precision in generated audio. Data scarcity for temporally annotated audio is addressed through LLM-driven augmentation—synthesizing new event mixtures and paraphrased captions, thus boosting model robustness (Huang et al., 2023).
Scripted Composition and Modular Orchestration: WavJourney parses high-level text instructions into structured “audio scripts” (JSON), which are compiled and executed as computer programs—each line invoking a domain-specific model (e.g., TTS, T2M, T2A)—to support narrative, compositional audio generation with interpretable and editable intermediate formats (Liu et al., 2023).
Few-shot and In-Context Learning: By aligning audio tokens to LLM vocabulary (as in UniAudio 1.5) or through prompt-based workflows, LLMs can perform audio classification, enhancement, and TTS using only a few demonstration pairs, without additional parameter updates (Yang et al., 14 Jun 2024).

3. Evaluation Metrics and Benchmarking

Evaluation depends on the task:

Audio-Text Retrieval: Models are measured using Recall@K (e.g., R@1), mean average precision (mAP), and normalized discounted cumulative gain (nDCG)—reflecting how well models align or retrieve across modalities. For example, models trained on AudioSetCaps achieve R@1 of 46.3% (text-to-audio) and 59.7% (audio-to-text) on AudioCaps (Bai et al., 28 Nov 2024).
Audio Generation: Objective metrics include Fréchet Audio Distance (FAD), CLAP Score (audio-text semantic similarity), Inception Score, Kullback-Leibler divergence, and Mean Opinion Score (MOS) for subjective human evaluation. Make-An-Audio 2 shows top performance across these, with significant gains in both faithfulness to prompt and audio quality (Huang et al., 2023).
Speech and Audio Recognition: Word Error Rate (WER) is standard for ASR and AVSR tasks. Llama-AVSR matches or surpasses previous SOTA with 0.81% (ASR) and 0.77% (AVSR) WER, using only ~40M trainable parameters and leveraging frozen modality encoders and LLM (Cappellazzo et al., 18 Sep 2024).
Latent Alignment: The ALAS metric quantifies the cross-modal alignment of model representations at each transformer layer, enabling deeper diagnostic and model selection for spoken language understanding (Mousavi et al., 26 May 2025).

4. Safety, Robustness, and Test-Time Adaptivity

Representation-Space Safety Alignment: Safety in LALMs is addressed through unsupervised fine-tuning strategies such as Reshaping Representation Space (RRS), which pushes representations for harmful queries toward a refusal cluster and benign queries away, as measured in latent space. RRS achieves significant safety gains (e.g., ASR reduction from 56.65% to 7.54% on Qwen-Audio) with only minimal increases in over-rejection and negligible impact on helpfulness (Yang et al., 26 May 2025).
Test-Time Compute Algorithms: For tasks with high auditory cognitive load (e.g., listening recall in noise), several test-time compute (TTC) methods, such as chain-of-thought prompting, majority sampling, beam search with log-likelihood weighting, and LLM-based verifier scoring, can yield improvements of 9–150% over baseline accuracy without retraining, and can bring weaker models closer to human performance or GPT-4o baseline (Dang et al., 30 Mar 2025).
Noise and Overlap Robustness: AR-NAR hybrid decoding, transcription prompt conditioning, and the use of pre-trained ASR experts as tokenizers help reduce hallucinations, repetitions, and insertion errors in noisy ASR, reducing CER by up to 12.2% and eliminating output repetitions (Li et al., 18 Aug 2024).
Spatial Audio Reasoning: Models can now incorporate intensity vectors from multichannel FOA input, enabling 3D sound source localization (mean angular error as low as 2.70°), far-field ASR, and spatially-guided speech extraction, supporting applications in robotics and AR/VR (Tang et al., 12 Jun 2024).

5. Large-Scale Data Resources and Automation

The acquisition of rich, aligned data is essential for high-capacity audio-text models:

Automated Data Generation Pipelines: The AudioSetCaps pipeline integrates LALMs (e.g., Qwen-Audio-Chat) for fine-grained content extraction, LLMs (Mistral 7B) for caption generation, and CLAP for semantic refinement, iteratively producing and filtering audio captions for scale (1.9M pairs for AudioSetCaps, extended to 6M+ pairs with YT-8M and VGGSound). Prompt chaining guarantees comprehensive annotation of speech, emotion, and music attributes. Public release facilitates benchmarking and further research (Bai et al., 28 Nov 2024).
Caption Quality and Coverage: Structured pipelines and CLAP-based filtering yield captions with high mean opinion scores and greater lexical richness—crucial for training and evaluating downstream retrieval and captioning models.
LLM-Driven Description Expansion: For egocentric or cross-modal retrieval, LLMs generate audio-centric captions from visual class labels or visual-centric tags, bridging annotation modality gaps and markedly improving retrieval accuracy (Oncescu et al., 29 Feb 2024).

6. Applications and Emerging Directions

Audio-Text LLMs enable a broad spectrum of applications, including:

Speech and Language Processing: Automatic speech recognition, speech-to-speech and speech-to-text translation, multilingual and zero-shot settings, speaker identification, and naturalistic voice transfer.
Audio Generation: Text-to-audio synthesis for creative or assistive scenarios, compositional authoring via script-based orchestration or event parsing, as seen in WavJourney and Make-An-Audio 2.
Retrieval and Captioning: Audio-visual question answering, music information search, automatic captioning, and cross-modal recommendation.
Human-Machine Interaction: Real-time, user-controllable audio generation for narrative content (WavJourney), voice-guided personalized AI assistants (AudioPaLM), and co-creative content workflows.
Music Production: Zero-shot mapping of language to audio effects parameters (Text2Fx), making music production accessible and explainable for non-experts (Doh et al., 27 May 2025).
Pronunciation Assessment and Education: Multi-modal LLMs enable end-to-end, align-free evaluation of fluency and accuracy, leveraging prompt tokens and modality adapters for robust scoring (Fu et al., 12 Jul 2024).
Safety-Critical and Social Applications: Safe deployment in open-ended conversation, robust action under adversarial or harmful inputs, and transparency/auditability via representation-space analysis.

7. Limitations, Metrics, and Open Problems

Cross-Modal Alignment: Strong semantic mapping between modalities is critical. Metrics such as ALAS systematically quantify alignment (layerwise) and task-specific requirements (semantic vs. non-semantic alignment).
Data Scarcity and Scaling: While automated pipelines have alleviated audio-text data scarcity, representation heterogeneity remains a challenge for low-resource or semantic-specific tasks.
Transfer and Ceiling Effects: Text-based LLMs with code intermediation can generate simple audio (e.g., musical notes), but performance declines sharply with increasing audio complexity—indicating the limits of textual world knowledge and the need for explicit multimodal pretraining (Anbazhagan et al., 4 May 2025).
Safety-Usability Trade-off: Core challenges in safety alignment, including over-rejection and loss of helpfulness, have motivated new unsupervised latent-space calibration approaches such as RRS.
Future Directions: Dynamic, task-aware alignment strategies, development of reward models for audio, extension to broader modalities (including vision and 3D), and further investigation of in-context, parameter-free multimodal reasoning are active areas of research.

Audio-Text LLMs continue to extend the capabilities of multimodal intelligence, bridging advances in representation learning, dataset augmentation, safe deployment, and robust, user-aligned generation. Continuing research aims to further close modality gaps, enrich cross-domain alignment, and make large-scale audio-text AI systems accessible and reliable across domains.