YuE: Scaling Open Foundation Models for Long-Form Music Generation

Published 11 Mar 2025 in eess.AS, cs.AI, cs.MM, and cs.SD | (2503.08638v1)

Abstract: We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

Abstract PDF Upgrade to Chat

Summary

The paper introduces innovative strategies for scaling open foundation models to generate long-form music with robust lyrical alignment and structural coherence.
It uses track-decoupled next-token prediction and structural progressive conditioning to model vocals/accompaniment separately and preserve long-context lyrical alignment.
The study employs a multitask, multiphase pre-training approach combining TTS, music generation, and lyrics-to-song tasks for enhanced model convergence and versatility.

The paper introduces innovative strategies for generating long-form music that retains lyrical alignment and coherent structure.

The track-decoupled next-token prediction method separately models vocal and accompaniment tokens to better capture their distinct dynamics.
Structural progressive conditioning segments songs into structured sections to preserve long-context lyrical alignment.
A multitask, multiphase pre-training recipe, combining TTS, music generation, and lyrics-to-song tasks, enhances model convergence and versatility.