UALM-Gen: Unified Text-to-Audio Model

Updated 17 October 2025

UALM-Gen is a text-to-audio model that unifies waveform generation with audio understanding through an autoregressive LLM architecture.
It employs innovative techniques like classifier-free guidance, direct preference optimization, and a VAE-based upsampler to enhance audio fidelity.
Trained on 30 million text-audio pairs and integrated into a unified pipeline, UALM-Gen supports creative synthesis and advanced multimodal reasoning.

UALM-Gen is a text-to-audio LLM introduced within the Unified Audio LLM (UALM) framework (Tian et al., 13 Oct 2025). UALM-Gen aims to unify high-quality text-to-audio generation with audio understanding and multimodal reasoning, representing a significant advancement in the integration of generative and comprehension capabilities within a single LLM architecture.

1. Model Architecture

UALM-Gen is constructed on a decoder-only autoregressive transformer body, initialized from a pretrained text LLM such as Qwen2.5 (in the 1.5B parameter variant). Its architecture incorporates three key modules:

Audio Input Processing: Raw 16kHz mono waveforms are transformed by a pretrained acoustic encoder (from prior ALM work) into continuous embeddings at 25 Hz. These are projected via a single-layer MLP adapter into the LLM’s hidden space.
Audio Output Prediction: Text-to-audio generation proceeds by predicting discrete audio tokens, with the model’s vocabulary extended to include codec audio tokens. The X-codec codec discretizes audio via residual vector quantization (RVQ): each 50 Hz frame is represented by eight discrete tokens. UALM-Gen employs a “delay pattern” for generation, staggering intra-frame token prediction to balance sequence length constraints with audio fidelity.
Enhancement Module: A VAE-based upsampler post-processes the generated 16kHz mono waveform to 48kHz stereo, applying spatialization and restoration, further improving perceptual quality.

All new adapter and audio embedding layers are randomly initialized or fine-tuned, while the LLM’s original weights are either retained or updated as needed. Integration across modalities is realized by augmenting the token vocabulary and embedding space to accommodate audio and text jointly.

2. Training Techniques

UALM-Gen utilizes a set of training methodologies that allow it to reach or match the quality of state-of-the-art diffusion-based models for text-to-audio synthesis.

Data Blending and Scale: UALM-Gen is trained on approximately 30 million text-audio pairs (totaling ~80k hours, order-of-magnitude larger than typical diffusion model datasets), complemented by text-only reasoning samples. Data blending balances modality coverage and stabilizes training.
Classifier-Free Guidance (CFG): At inference, CFG interpolates conditional and unconditional audio token generation probabilities:

$\pi_\theta^{CFG}(y_t | y_{1:t-1}, x) = \lambda \pi_\theta(y_t | y_{1:t-1}, x) + (1-\lambda)\pi_\theta(y_t | y_{1:t-1}, \emptyset)$

Here $\lambda \geq 1$ tunes the strength of guidance.

Direct Preference Optimization (DPO): DPO finetunes the model via reinforcement-style preference loss. Winning ( $y_w$ ) and losing ( $y_l$ ) generations per prompt are identified via audio quality metrics (CLAP, AudioBox-Aesthetic), optimizing:

$\mathcal{L}_{DPO}(\pi_\theta) = -\mathbb{E}_{x, y_w, y_l \sim \mathcal{D}} \left[\log \sigma\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$

where $\beta$ controls the reference/model distance.

Modality Alignment Curriculum: Prior to full multimodal fine-tuning, the model undergoes a phase wherein only its adapters and new audio embeddings are trained (transformer backbone frozen), mitigating convergence mismatch across modalities.
Sampling and Self-Adaptation: Sequence packing and top- $k$ audio token sampling optimize inference speed and output diversity.

3. Performance Metrics

UALM-Gen is quantitatively and qualitatively evaluated alongside diffusion-based text-to-audio systems:

Objective evaluation utilizes Frechet Distance (FD; using OpenL3 embeddings), KL divergence (PaSST classifier), Inception Score (IS; PANNs classifier), CLAP Score (CL; LAION-CLAP alignment), and AudioBox-Aesthetic Score (AES).
Subjective Mean Opinion Score (MOS) ratings for overall quality (OVL) and prompt relevance (REL) are gathered from human listeners.

Results indicate that, after application of CFG, DPO finetuning, and the VAE enhancement module, UALM-Gen performs at a level competitive with, and sometimes surpasses, leading diffusion-based models. The model is noted for preserving high text reasoning capability while achieving strong fidelity and relevance in audio synthesis.

4. Integration within the Unified Audio LLM (UALM) Framework

UALM-Gen is the generative module in the unified UALM pipeline, integrated as follows:

Unified Model Backbone: The underlying transformer is sequentially trained for text, audio understanding, text-to-audio generation (UALM-Gen), and multimodal reasoning (UALM-Reason), facilitated by modality-aligned adapters and vocabulary.
Interleaved Joint Training: Multimodal datasets enable UALM to process both audio and text for understanding and generation, maintaining balanced performance and cross-modal reasoning capabilities.
UALM-Reason Extension: In later stages, reasoning over both text and audio (with cross-modal intermediate steps) is enabled, demonstrating UALM’s capacity for generative multimodal reasoning.

5. Mathematical Components

Key formulas central to UALM-Gen’s design:

Component	Formula/Notation	Role
Classifier-Free Guidance	$\pi_\theta^{CFG}(y_t\|y_{1:t-1},x)=\lambda\pi_\theta(y_t\|y_{1:t-1},x)+(1-\lambda)\pi_\theta(y_t\|y_{1:t-1},\emptyset)$	Inference guidance for token generation
Direct Preference Opt.	$-\mathbb{E}_{x,y_w,y_l\sim\mathcal{D}}[\log \sigma(\beta\log\frac{\pi_\theta(y_w\|x)}{\pi_{ref}(y_w\|x)}-\beta\log\frac{\pi_\theta(y_l\|x)}{\pi_{ref}(y_l\|x)})]$	RL-style refinement based on preferences
RVQ Tokenization	$\hat{x} = \sum_{n=1}^{n_q} c_{n,i_n},\;\; i_n = \mathrm{argmin}_k \\|r_{n-1} - c_{n,k}\\|, \;\; r_0 = x, \;\; r_n = r_{n-1} - c_{n,i_n}$	Audio frame-to-token encoding

6. Applications and Implications

The UALM-Gen model enables several advanced audio AI capabilities:

Creative Synthesis: Rapid generation of soundtracks, sound effects, and music from textual prompts for film, gaming, and virtual experiences.
Integrated Multimedia Authoring: UALM can comprehend existing audio, generate new audio, and iteratively refine outputs, supporting workflows akin to artistic composition and review.
Assistive Interactive Systems: Multimodal reasoning supports dialog-based, interactive refinement of audio synthesis, relevant for accessibility tools and creative co-pilots.
Foundation for Multimodal Audio AI: Unification of audio understanding and generation paves the way for richer audio representations and more holistic evaluation protocols for generative models.

A plausible implication is the move toward unified, controllable, cross-modal AI systems capable of supporting complex tasks that require iterative synthesis and comprehension of multimodal data.

7. Significance and Future Directions

UALM-Gen demonstrates that autoregressive LLMs, with proper architectural augmentation (audio adapters, discrete token prediction), data scaling, and reinforcement-style preference optimization, can rival diffusion-based approaches in text-to-audio fidelity, with added benefits in reasoning and cross-modal synthesis. This work establishes a blueprint for unified multimodal audio generation and suggests future research directions into enhanced audio representation learning, evaluation metrics, and interactive generative models.

PDF Markdown Chat (Pro)

References (1)

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning (2025)

Follow Topic

Get notified by email when new papers are published related to UALM-Gen.