AudioLM: a Language Modeling Approach to Audio Generation (2209.03143v2)

Published 7 Sep 2022 in cs.SD, cs.LG, and eess.AS

Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a LLMing task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked LLM pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

PDF Abstract

AudioLM: A LLMing Approach to Audio Generation

The paper introduces AudioLM, a novel framework for audio generation that combines the quality of high-fidelity audio output with the structural coherence of long-term LLMing. By reframing audio generation as a LLMing task, AudioLM effectively melds semantic and acoustic tokenization techniques to address the multi-scale abstraction challenges inherent in audio synthesis. This approach facilitates the generation of audio that is both acoustically rich and structurally cohesive, extending to applications in both speech and music synthesis.

Technical Insights

Hybrid Tokenization Scheme

AudioLM leverages a dual-tokenization scheme to reconcile the trade-off between reconstruction quality and long-term structure:

Semantic Tokens: Derived from a self-supervised, masked LLM trained on audio (w2v-BERT), these tokens capture long-term dependencies but offer poor reconstruction quality.
Acoustic Tokens: Produced by a SoundStream neural codec, these tokens afford high-quality synthesis by capturing fine acoustic details, albeit at the cost of long-term coherence.

Hierarchical Model Architecture

The framework deploys a three-stage hierarchical architecture for autoregressive prediction:

Semantic Modeling: Captures global structural aspects like language syntax and semantic content.
Coarse Acoustic Modeling: Predicts coarse-level SoundStream tokens conditioned on semantic tokens, maintaining high-level acoustics.
Fine Acoustic Modeling: Focuses on refining the acoustic details to enhance reconstruction quality further.

Empirical Evaluation

Through extensive experimentation on speech and piano music datasets, AudioLM demonstrates remarkable capabilities:

Speech Generation: The model generates syntactically and semantically coherent continuations of speech while preserving speaker identity and prosodic characteristics across varying speakers and environments. Quantitative ASR evaluations show low WER and CER, reflecting high semantic fidelity.
Linguistic Probing: AudioLM surpasses previous state-of-the-art models on the sWUGGY and sBLIMP benchmarks, indicating strong lexical and syntactic competence without text supervision.
Piano Continuation: The model's efficacy extends to music generation, where it produces musically coherent piano continuations, indicating its adaptability to non-speech audio domains.

Implications and Future Directions

The findings pave the way for advancements in audio applications, such as multi-lingual speech synthesis, polyphonic music generation, and audio event modeling. The robust separation of semantic and acoustic elements enhances the potential for encoder-decoder architectures, promising breakthroughs in tasks like text-to-speech or cross-lingual speech translation.

However, the ability to generate high-quality and structure-accurate audio introduces risks, particularly in the misuse of synthetic speech for malicious purposes. The paper addresses this by proposing a method for detecting synthetic speech, achieving high detection accuracy, thus contributing to responsible AI practices.

Conclusion

AudioLM represents a significant stride in audio generation technology, effectively bridging the gap between quality and coherence. By integrating semantic understanding with high-fidelity acoustic modeling, it offers a versatile tool for generating diverse audio content, while simultaneously addressing ethical challenges through detection safeguards. As such, it sets a new benchmark in the synthesis of natural and structured audio across various applications.