Simple and Controllable Music Generation (2306.05284v3)

Published 8 Jun 2023 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: We tackle the task of conditional music generation. We introduce MusicGen, a single LLM (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

PDF Abstract

Simple and Controllable Music Generation

Introduction

This paper introduces MusicGen, a single-stage LLM for conditional music generation. The primary innovation of MusicGen lies in its ability to generate high-quality music through a simplified architecture that eschews the need for multiple cascaded models. MusicGen operates over compressed discrete music representations (tokens) and employs an efficient token interleaving strategy to eliminate the complexity of hierarchical or upsampling methods. By conditioning the model on textual descriptions and melodic features, MusicGen offers enhanced control over the generated music.

Methodology

Model Architecture

MusicGen employs an autoregressive transformer-based decoder conditioned on text or melody representations. The core of the music representation is an EnCodec audio tokenizer that provides high-fidelity reconstruction from a low frame rate discrete representation. The quantization employs Residual Vector Quantization (RVQ), where each quantizer encodes the residual from the previous quantizer.

Codebook Interleaving Patterns

The paper introduces various codebook interleaving patterns to handle multiple parallel streams of quantized audio tokens. These patterns determine how tokens from different codebooks are interleaved or predicted in parallel, thereby balancing the trade-off between computational efficiency and model performance. The patterns include "parallel", "delay", "partial delay", and "flattening".

Model Conditioning

MusicGen supports both text and melody conditioning. Text conditioning uses text encoders (e.g., T5, FLAN-T5, or CLAP) to transform textual descriptions into embedding representations. Melody conditioning leverages chromagram-based representations to guide the generated music to follow a given harmonic structure. This unsupervised approach eliminates the need for labeled melody data.

Experimental Setup

Datasets

The training data comprised 20,000 hours of licensed music, including high-quality tracks and collections from ShutterStock and Pond5. The MusicCaps benchmark was used for evaluation. This dataset contains expert-annotated musical samples and provides a balanced subset for genre-specific evaluation.

Baselines and Metrics

The paper compares MusicGen to several baselines, including Riffusion and Mousai, and uses objective metrics such as Fréchet Audio Distance (FAD), Kullback-Leibler Divergence (KL), and CLAP score to evaluate performance. Subjective evaluations assessed overall quality and relevance to the text input.

Results

MusicGen consistently outperformed the evaluated baselines in subjective human ratings for both audio quality and adherence to text descriptions. Interestingly, incorporating melody conditioning showed no significant impact on human ratings but improved control over the melodic structure, as indicated by the cosine similarity metric for melody adherence.

Ablation Studies

The authors conducted extensive ablation studies to assess the impact of different design choices, such as model size, text augmentation strategies, text encoders, and codebook interleaving patterns. The "flattening" pattern provided the best performance in terms of subjective metrics but at a higher computational cost. Smaller models showed improvements with D-Adaptation-based optimization; however, this did not scale well for larger models.

Implications and Future Work

MusicGen's simplification of the music generation process allows for more efficient training and inference. While current results are promising, future research could explore further improvements in controllability and generalization across diverse music genres. Implementing advanced data augmentation techniques, especially for melody conditioning, and extending the model to support more sophisticated musical structures are potential areas for development.

The paper also highlights ethical considerations, emphasizing the use of legally sourced and diverse training data. Ensuring fair competition and accessibility of generative models in music creation is critical. Open research and the development of intuitive controls could make these models valuable for both amateur and professional musicians.

Conclusion

MusicGen represents a significant advancement in the field of conditional music generation. By leveraging simplified architectures and innovative token interleaving strategies, it achieves high-quality music generation with enhanced controllability. The paper's comprehensive evaluation and ablation analyses provide valuable insights into the model's performance and potential areas for future research, making it a substantial contribution to the domain of AI-driven music generation.

Music samples, code, and models are made available at github.com/facebookresearch/audiocraft.