MusicRL: Aligning Music Generation to Human Preferences (2402.04229v1)

Published 6 Feb 2024 in cs.LG, cs.SD, and eess.AS

Abstract: We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

PDF Abstract

Introduction

Researchers at Google DeepMind present MusicRL, a system designed to fine-tune the generation of music through the reinforcement learning paradigm utilizing human feedback. This is an innovative and significant step in the text-to-music generation field, addressing a unique challenge: musical compositions' subjective nature. By collecting and incorporating individual preferences, MusicRL aims to create music that better aligns with human tastes.

Approach

MusicRL evolves from a MusicLM base, an initial model capable of high-fidelity, text-controlled music generation. However, to enhance its musical outputs' quality and adherence to text prompts, researchers integrate a fine-tuning process influenced by human feedback. Initial, enhanced versions, MusicRL-R, utilize reward functions focusing on text-adherence and audio quality. The pivotal step involves deploying the model to users to gather a substantial dataset (300,000 preferences), enabling the fine-tuning of MusicRL-U through Reinforcement Learning from Human Feedback (RLHF). Finally, a sequential combination of these methods produces MusicRL-RU, a model showing the strongest alignment with human preferences.

Results

The utility of the MusicRL methodologies is clear in the numerical results. MusicRL-R and MusicRL-U outperformed the MusicLM baseline, with preferences of 65% and 58.6% over the base model, respectively. Crucially, the combined approach, MusicRL-RU, reached a preference rate of 66.7% versus the baseline MusicLM. These are compelling figures that illustrate the merits of integrating human feedback into musical generation AI models.

Conclusion

In essence, MusicRL demonstrates the transformative potential of integrating continuous human feedback into the fine-tuning of generative AI models for music. While adherence to text and audio quality improvements account for measurable enhancements, the work acknowledges the complex nature of musical appreciation and points to further research opportunities that leverage human feedback at various stages of model generation and refinement. This research underscores the need for more nuanced fine-tuning methods that consider the diverse and subjective facets of human musical preferences.