MusicLM: Generative Music from Text
- MusicLM is a text-conditional music generation model that uses a hierarchical autoregressive approach to produce minute-scale, high-fidelity audio.
- It employs multi-level tokenization—including semantic, coarse, and fine acoustic tokens—conditioned on text, melody, and brain activity for versatile music synthesis.
- Evaluation using metrics like FAD and MCC demonstrates MusicLM’s superior performance, though its heavy sampling cost poses challenges for real-time applications.
MusicLM is a state-of-the-art generative audio LLM for high-fidelity music generation conditioned on rich text descriptions. Developed by Google Research, MusicLM implements a hierarchical sequence-to-sequence autoregressive architecture over discrete audio representations, enabling consistent minute-scale music generation at 24 kHz and supporting multi-modal conditioning, including melody and, via surrogate embeddings, even brain activity.
1. Hierarchical Architecture and Tokenization
MusicLM formalizes text-conditional music generation as hierarchical autoregressive modeling over quantized token sequences at three abstraction levels: semantic, coarse acoustic, and fine acoustic. Each hierarchy is modeled by a dedicated decoder-only Transformer, which captures progressively finer-grained temporal and acoustic dependencies (Agostinelli et al., 2023, Lam et al., 2023).
- Semantic Tokens: Derived by k-means quantization (1024 centroids) of 25 Hz representations from a 600M-parameter w2v-BERT model, targeting high-level musical structure.
- Coarse Acoustic Tokens: Obtained by passing 24 kHz audio through SoundStream (RVQ with 12 codebooks × 1024 entries) at 50 Hz, capturing intermediate fidelity waveform details.
- Fine Acoustic Tokens: Generated by further residual quantization atop the coarse tokens, encoding the signal at a higher frame rate (100–250 Hz) to maximize fidelity.
A MuLan joint audio-text embedding model (128-dim, 10 s windows) encodes text prompts into discrete tokens via RVQ, serving as conditioning information at generation time.
Hierarchical generation chain: where each stage is autoregressive, and conditionally dependent on outputs from previous stages and MuLan tokens.
2. Training Objectives and Loss Functions
Each stage in MusicLM’s hierarchy is trained via cross-entropy with teacher forcing to maximize the conditional log-likelihood of the respective token sequences. Additional upstream encoder modules employ their own objectives:
- MuLan: Contrastive loss aligns audio and text embeddings:
with as cosine similarity.
- SoundStream and w2v-BERT: Use reconstruction and masked language modeling losses, respectively.
- Melody Conditioning: A small ViT-based encoder, trained using a semi-hard triplet loss, allows the model to support text-plus-melody synthesis.
Probabilistic Formulation:
3. Generation and Conditioning Modalities
At inference, MusicLM accepts text prompts encoded into MuLan tokens. Optionally, it can be conditioned on melody tokens extracted from whistled, hummed, or played melodies. The generation proceeds by rendering semantic tokens (temperature=1.0), followed by coarse acoustic (temperature=0.95) and then fine acoustic tokens (temperature=0.4), each sampled autoregressively. Synthesized tokens are decoded into the waveform using SoundStream (Agostinelli et al., 2023).
Long-duration generation employs sliding-window sampling with context concatenation, and "story mode" allows for prompt changes at intervals.
Pseudocode Outline:
1 2 3 4 5 6 |
M_T = MuLan_text_RVQ(text) L = Melody_RVQ(melody_audio) if melody_audio else ∅ S = sample_semantic_tokens(M_T, L) C = sample_coarse_tokens(S, M_T, L) F = sample_fine_tokens(C, S, M_T, L) x = SoundStream_decoder(C, F) |
4. Evaluation, Metrics, and Memorization
MusicLM is evaluated on the MusicCaps dataset (5.5k music-text pairs, expert-annotated) using both objective and subjective criteria:
- Objective Metrics:
- Fréchet Audio Distance (FAD) on Trill and VGGish embeddings (lower is better).
- KL divergence on AudioSet classifier outputs.
- MuLan Cycle Consistency (MCC): cosine similarity between embedding of generated audio and corresponding prompt.
- Human Preference: Blind A/B comparison for musicality, audio quality, and text adherence.
| Model | FAD (Trill) | FAD (VGGish) | KLD ↓ | MCC ↑ | Human Wins (600) |
|---|---|---|---|---|---|
| Riffusion | 0.76 | 13.4 | 1.19 | 0.34 | 158 |
| Mubert | 0.45 | 9.6 | 1.58 | 0.32 | 97 |
| MusicLM | 0.44 | 4.0 | 1.01 | 0.51 | 312 |
| Ground truth | — | — | — | — | 472 |
Ablation demonstrates that excluding the semantic stage reduces long-range musical structure. Memorization tests show <0.2% verbatim replay against the training set for 10s prompts, with <1% approximate matches, indicating minimal memorization.
5. Computational Efficiency and Successors
A limitation of MusicLM is its heavy sampling cost, as all three stages are fully autoregressive, requiring ∼625 forward passes per second of audio (e.g., 6,250 passes for a 10s clip) (Lam et al., 2023). This computational burden has motivated research into more efficient paradigms.
MeLoDy is a successor model that retains MusicLM’s semantic LM but replaces both acoustic LMs with a non-autoregressive, conditioned dual-path diffusion process in a VAE-GAN latent space. This innovation reduces the required forward passes by 95.7% for 10 s and over 99% for 30 s segments while matching or exceeding MusicLM’s quality on standard metrics.
6. Extensions: fMRI Conditioning and Brain2Music
MusicLM exhibits flexible conditioning capabilities beyond text and melody. Brain2Music demonstrates adaptation of MusicLM to reconstruct music from brain activity captured by fMRI (Denk et al., 2023). The pipeline applies the following steps:
- Preprocesses fMRI signals with motion correction, drift removal, registration, and ROI selection.
- Trains a voxelwise ridge regression mapping from fMRI vectors to 128-dim MuLan embeddings.
- Injects averaged predicted MuLan embeddings into all three MusicLM stages as “prefix” or cross-attention tokens.
- Reconstructs music exhibiting semantic similarity (genre, instrumentation, mood) to the experienced stimulus.
High-level music semantics are decodable from fMRI and generable via MusicLM, but low-level acoustic (e.g., note-level) details are not reliably recoverable from brain signals alone.
7. Applications and Limitations
MusicLM’s text-conditional generation supports creative tasks such as music synthesis, editing, interactive musical dialog, and music retrieval via MuLan embeddings. Its modular conditioning design enables rich multi-modal applications, as evidenced by melody transformation and the neuroscientific reconstruction demonstrated in Brain2Music.
However, due to the hierarchical, stack-like autoregressive design, MusicLM incurs high computational cost, hindering real-time or interactive applications without architectural modifications or efficient distillation. Furthermore, while high-level musical properties are well modeled, precise recreation of low-level acoustic details remains an open challenge, particularly when conditioning inputs are noisy or underspecified (e.g., non-exact fMRI).
References:
- "MusicLM: Generating Music From Text" (Agostinelli et al., 2023)
- "Efficient Neural Music Generation" (Lam et al., 2023)
- "Brain2Music: Reconstructing Music from Human Brain Activity" (Denk et al., 2023)
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free