Simple and Controllable Music Generation
Introduction
This paper introduces MusicGen, a single-stage LLM for conditional music generation. The primary innovation of MusicGen lies in its ability to generate high-quality music through a simplified architecture that eschews the need for multiple cascaded models. MusicGen operates over compressed discrete music representations (tokens) and employs an efficient token interleaving strategy to eliminate the complexity of hierarchical or upsampling methods. By conditioning the model on textual descriptions and melodic features, MusicGen offers enhanced control over the generated music.
Methodology
Model Architecture
MusicGen employs an autoregressive transformer-based decoder conditioned on text or melody representations. The core of the music representation is an EnCodec audio tokenizer that provides high-fidelity reconstruction from a low frame rate discrete representation. The quantization employs Residual Vector Quantization (RVQ), where each quantizer encodes the residual from the previous quantizer.
Codebook Interleaving Patterns
The paper introduces various codebook interleaving patterns to handle multiple parallel streams of quantized audio tokens. These patterns determine how tokens from different codebooks are interleaved or predicted in parallel, thereby balancing the trade-off between computational efficiency and model performance. The patterns include "parallel", "delay", "partial delay", and "flattening".
Model Conditioning
MusicGen supports both text and melody conditioning. Text conditioning uses text encoders (e.g., T5, FLAN-T5, or CLAP) to transform textual descriptions into embedding representations. Melody conditioning leverages chromagram-based representations to guide the generated music to follow a given harmonic structure. This unsupervised approach eliminates the need for labeled melody data.
Experimental Setup
Datasets
The training data comprised 20,000 hours of licensed music, including high-quality tracks and collections from ShutterStock and Pond5. The MusicCaps benchmark was used for evaluation. This dataset contains expert-annotated musical samples and provides a balanced subset for genre-specific evaluation.
Baselines and Metrics
The paper compares MusicGen to several baselines, including Riffusion and Mousai, and uses objective metrics such as Fréchet Audio Distance (FAD), Kullback-Leibler Divergence (KL), and CLAP score to evaluate performance. Subjective evaluations assessed overall quality and relevance to the text input.
Results
MusicGen consistently outperformed the evaluated baselines in subjective human ratings for both audio quality and adherence to text descriptions. Interestingly, incorporating melody conditioning showed no significant impact on human ratings but improved control over the melodic structure, as indicated by the cosine similarity metric for melody adherence.
Ablation Studies
The authors conducted extensive ablation studies to assess the impact of different design choices, such as model size, text augmentation strategies, text encoders, and codebook interleaving patterns. The "flattening" pattern provided the best performance in terms of subjective metrics but at a higher computational cost. Smaller models showed improvements with D-Adaptation-based optimization; however, this did not scale well for larger models.
Implications and Future Work
MusicGen's simplification of the music generation process allows for more efficient training and inference. While current results are promising, future research could explore further improvements in controllability and generalization across diverse music genres. Implementing advanced data augmentation techniques, especially for melody conditioning, and extending the model to support more sophisticated musical structures are potential areas for development.
The paper also highlights ethical considerations, emphasizing the use of legally sourced and diverse training data. Ensuring fair competition and accessibility of generative models in music creation is critical. Open research and the development of intuitive controls could make these models valuable for both amateur and professional musicians.
Conclusion
MusicGen represents a significant advancement in the field of conditional music generation. By leveraging simplified architectures and innovative token interleaving strategies, it achieves high-quality music generation with enhanced controllability. The paper's comprehensive evaluation and ablation analyses provide valuable insights into the model's performance and potential areas for future research, making it a substantial contribution to the domain of AI-driven music generation.
Music samples, code, and models are made available at github.com/facebookresearch/audiocraft.