AudioLM: Token-Based Audio Generation

Updated 30 December 2025

AudioLM is a token-based language modeling framework that converts audio into semantic and acoustic tokens for coherent synthesis.
It employs a two-stage hierarchical modeling process leveraging neural codecs and autoregressive transformers to capture both long-term structure and fine-grained details.
Empirical evaluations demonstrate enhanced audio fidelity, speaker consistency, and subjective quality compared to baseline models.

AudioLM is a language modeling framework that treats audio generation as a sequence modeling task in the discrete representation space obtained from neural audio tokenizers. It extends the language modeling paradigm to the audio domain by leveraging discrete tokenization methods—specifically, neural codecs and self-supervised representation learning—to capture both high-level semantic and low-level acoustic information. AudioLM enables high-fidelity, long-range, and coherent audio synthesis—including speech, music, and general audio—without requiring text transcripts or symbolic representations. Its two-stage, hierarchical modeling approach has become a foundational blueprint for subsequent advances in codec-based audio generative models (Borsos et al., 2022, Wu et al., 2024).

1. Motivation and Conceptual Framework

AudioLM was introduced to address the challenge of synthesizing audio with long-term structure, semantic plausibility, and high acoustic fidelity. The core insight is that state-of-the-art neural audio codecs (such as SoundStream) can discretize waveform audio into multiple streams of tokens: one stream capturing high-level structure (semantics) and another encoding fine-grained acoustic details (prosody, speaker identity, timbre) (Borsos et al., 2022, Wu et al., 2024). By modeling these sequences with LLMs, AudioLM enables natural, prompt-conditioned continuation and generation capabilities without explicit supervision at the text level.

The workflow, therefore, comprises:

Tokenizing audio into semantic and acoustic code sequences.
Autoregressive language modeling over these sequences with hierarchical conditional dependencies.

2. Neural Audio Tokenization Pipeline

AudioLM employs a hybrid tokenization strategy based on two neural tokenizers:

Semantic Tokens: Derived from a pre-trained w2v-BERT model, which provides 25 Hz embeddings. After masked LLM pretraining on audio, the hidden vectors from an intermediate layer are quantized via K-means (typically 1024 clusters), yielding a sequence of token IDs that encode long-term content, structure, and semantics.
Acoustic Tokens: Generated by a neural codec, specifically SoundStream, which encodes waveform features at 50 Hz. SoundStream employs a residual vector quantizer (RVQ) with Q parallel codebooks (e.g., Q=12, each with 1024 codes), capturing pitch, timbre, noise, and higher order acoustic properties.

The resulting representation consists of:

A semantic token sequence $\mathbf{c} = (c_1, \dots, c_S)$
An acoustic token matrix $\mathbf{a} = (a_t^q)$ , $t = 1,\dots,A$ ; $q = 1,\dots,Q$ .

SoundStream is trained with a multi-term loss comprising $\ell_1$ reconstruction, adversarial loss via HiFi-GAN style discriminators, and vector quantization commitment loss (Wu et al., 2024, Borsos et al., 2022).

3. Hierarchical Language Modeling Architecture

AudioLM's generative process factors the modeling of the full audio token sequence into a hierarchy corresponding to semantic and acoustic stages: $p(\mathbf{z}) = p(\mathbf{c}) \cdot p(\mathbf{a} \mid \mathbf{c})$ where

$p(\mathbf{c}) = \prod_{s=1}^S p(c_s \mid c_{<s})$

$p(\mathbf{a} \mid \mathbf{c}) = \prod_{t=1}^A p(a_t \mid a_{<t}, \mathbf{c})$

Each stage employs a decoder-only Transformer with standard language modeling objectives (autoregressive cross-entropy), embedding lookup tables for codebook indices, and relative positional encodings (Borsos et al., 2022, Wu et al., 2024).

The architecture typically comprises 12–24 Transformer layers (hidden size 512–1024, 8–16 heads) per stage. Semantic and acoustic token streams can be aligned in time using appropriate upsampling/duplication schedules to synchronize frame rates.

A hybrid modeling scheme further subdivides acoustic generation into coarse and fine quantizer levels:

Stage 1: Semantic token modeling (long-term structure)
Stage 2: Coarse acoustic token modeling (first Q' quantizers, speaker identity/prosody)
Stage 3: Fine acoustic token modeling (remaining quantizers, highest fidelity)

4. Training and Inference Procedures

Training Objective:

Both semantic and acoustic stages are trained independently using next-token prediction (autoregressive cross-entropy). No reconstruction or auxiliary VQ loss is applied during LM training; only the codec parameters are trained on raw waveform data (Borsos et al., 2022, Wu et al., 2024).

Dataset:

Large-scale corpora such as Librilight (60k h, unlab-60k) for speech and tens of thousands of hours of music or general audio.

Generation/Inference Workflow:

Extract prompt tokens from a short audio context (seed).
Autoregressively generate new semantic tokens as continuation.
Conditional on the full semantic sequence, generate new acoustic tokens.
Decode the merged token stream via the SoundStream decoder to obtain a waveform.

Temperature-based sampling, nucleus/top-k strategies, and removal of duplicate tokens in semantic streams are routinely used for diversity and accuracy. The inference pipeline is inherently sequential due to the hierarchical autoregressive structure, which imposes latency constraints for long sequences (Borsos et al., 2022).

5. Evaluation and Empirical Results

AudioLM demonstrates strong empirical performance across a wide spectrum of metrics and task settings:

Speech Continuation and TTS:
- Fréchet Audio Distance (FAD) scores 30–50% lower (better) than one-stage LM baselines.
- Subjective human evaluation (MOS): natural, plausible continuations indistinguishable from real speech 51.2% of the time.
- Word Error Rate (WER) via ASR for acoustic generation: CER = 3.4%, WER = 6.0% (comparable or superior to prior approaches such as GSLM).
- Speaker preservation: continuation and prompt match accuracy 92.6%.
- Zero-shot linguistic probes (sWUGGY, sBLIMP): AudioLM causal achieves sWUGGY all/in-vocab = 71.5/83.7%, sBLIMP = 64.7% (Borsos et al., 2022).
Music and Non-Speech Audio:
- Trained on large piano datasets, AudioLM achieves 83.3% subjective preference over acoustic-only baselines for piano continuations.
Versatility:

AudioLM generalizes beyond speech to non-verbal and musical audio domains, contingent on the semantic tokenization pipeline (Borsos et al., 2022, Wu et al., 2024).

A table summarizing key metrics:

Metric	AudioLM	GSLM	Baseline
FAD (↓)	30–50% lower	—	—
MOS	≈ Real	—	—
WER (Speech cont.)	6.0%	6.6%	—
Speaker Pres. (%)	92.6	—	Random
sWUGGY All (%)	71.5	< prior	< prior
A/B Pref (Piano)	83.3%	—	—

6. Limitations and Extensions

AudioLM's main limitations include:

Quadratic attention cost in Transformer architectures, limiting feasible sequence duration (approx. 10–15 s context).
Requirement for pre-trained tokenizers (w2v-BERT, SoundStream) fixed at training, precluding codec–LM joint optimization.
Sensitivity of semantic tokenization for non-speech audio; generalization to environmental or highly polyphonic sound remains challenging.
Multi-stage inference incurs computational cost and increased latency relative to single-pass models (see Llama-Mimi for alternatives (Sugiura et al., 18 Sep 2025)).

Security and ethical concerns involve potential spoofing or speaker impersonation, though detector classifiers can identify generated samples with ≈98.6% accuracy (Borsos et al., 2022).

Open research directions:

End-to-end joint training of codec and LLM.
Scaling to universal (speech, music, sound effect) audio generative models.
Multi-lingual, cross-domain transfer, and fine-grained prosody modeling.
Efficient, parallel, or streaming-capable model architectures (SoundStorm, Llama-Mimi as exemplars) (Borsos et al., 2023, Sugiura et al., 18 Sep 2025).

7. Influence and Successor Approaches

AudioLM's two-stage, codec-conditioned Transformer paradigm has catalyzed a range of follow-on works, including VALL-E, AudioPaLM, MusicLM, Llama-Mimi, and SoundStorm. All share the principle of operating on neural codec tokens, often introducing refinements such as non-autoregressive or parallel generation, interleaved token streams, or LLM architectures (Sugiura et al., 18 Sep 2025, Borsos et al., 2023, Wu et al., 2024). Comparative results indicate that while AudioLM achieves competitive or superior quality on core metrics, newer models obtain dramatic speedups (e.g., SoundStorm being 100× faster), improved acoustic consistency, or end-to-end text+audio modeling in a single stack.

A plausible implication is that the codec-token-based modeling paradigm will remain dominant, with innovations focusing on the balance of modeling capacity, computational efficiency, and alignment of semantic and acoustic representations across domains.