Jukebox: A Generative Model for Music (2005.00341v1)

Published 30 Apr 2020 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

Citations (657)

View on Semantic Scholar

Summary

The paper introduces a hierarchical VQ-VAE framework that compresses raw audio into three abstraction levels to efficiently handle long-range dependencies.
It leverages autoregressive Transformers and upsamplers to generate music with high fidelity, achieving impressive results at 44 kHz and using billions of parameters.
Conditioning on artist, genre, and lyrics enables precise control over the music generation, paving the way for innovative applications in music production and voice synthesis.

An Overview of "Jukebox: A Generative Model for Music"

The paper "Jukebox: A Generative Model for Music" presents a multifaceted approach for generating music, featuring vocals, directly in the raw audio domain. The cornerstone of this work is a multi-scale Vector Quantized Variational Autoencoder (VQ-VAE) coupled with autoregressive Transformers to handle the massive complexity of modeling raw audio. Their model, Jukebox, manifests notable advancements in generating high-fidelity music with maintaining coherence across longer time scales, spanning several minutes.

Methodology

The main approach employs a hierarchical VQ-VAE to compress the raw audio waveform into discrete latent codes, segmented into three different abstraction levels. This design effectively reduces the dimensionality of the raw audio, thus mitigating computational challenges associated with the long-range dependencies inherent in music. Each VQ-VAE level applies residual networks with dilated convolutions, strategically compressing the audio with decreasing resolution:

Bottom-Level VQ-VAE: Captures fine-grained details with an 8x compression rate.
Middle-Level VQ-VAE: Further compresses the audio by 32x.
Top-Level VQ-VAE: Captures the highest level of abstraction with 128x compression.

To generate music, the model employs a combination of autoregressive Transformers:

Top-Level Prior: Models long-term dependencies.
Upsamplers: These reconstruct finer audio details at the middle and bottom levels.

Conditioning and Control

Jukebox is designed to be highly controllable through conditioning on:

Artist and Genre: Steering the generated music to match specific styles.
Lyrics: Offering an interesting LTS (Lyrics-to-Singing) task, aligning raw text with generated singing.

An encoder-decoder architecture is leveraged for the LTS task. The encoder processes the lyrics, and the decoder generates the music while attending to encoded lyrics. This compositional structure exhibits a remarkable capacity to produce coherent singing, aligning well with the provided lyrics.

Key Results

The authors release numerous samples demonstrating various capabilities of Jukebox, including generating songs across diverse genres like rock, hip hop, and jazz. Statistical evaluations, such as spectral convergence, show promising results:

High fidelity is achieved with 44 kHz VQ-VAE models and 1B parameter upsamplers overcoming previous limitations like grainy textures.
The 5B parameter top-level prior extends the model's capacity to generate diverse and more coherent music.

Implications and Future Work

Practical Implications:

Music Production: Jukebox provides tools for musicians and producers to synthesize novel and high-quality music, potentially transforming creative workflows.
Voice Synthesis: The framework could be adapted for applications in voice-assisted technologies, generating realistic and stylistically diverse vocal outputs.

Theoretical Implications:

Representation Learning: The hierarchical VQ-VAE introduces a novel way to compress and understand high-dimensional data like music, contributing to the field of unsupervised learning.
Scalable Transformers: Simplifying sparse attention mechanisms could stimulate efficient implementations in other domains, such as image and video generation.

Future Developments:

Improving Long-Term Structure: The current model excels at mid-range coherence but lacks traditional long-term musical structures (e.g., repeating choruses). Enhancing this will be crucial.
Efficiency: Sampling times remain computationally intensive, with top-level token generation and upsampling processes requiring significant time, which points to a clear path for optimization.
Expanding Dataset: Training on more diverse datasets, including non-English songs, could broaden the model's applicability and robustness.

In conclusion, the research presents substantial progress in AI-driven music generation, balancing coherence, diversity, and style mimicry. Future work aims to refine these capabilities, making Jukebox a more powerful tool for both AI research and practical music production.