MUSE: Open-Source Song Generation System
- MUSE is an open-source framework for controllable long-form song generation that leverages mixed text and audio tokens for unified sequence modeling.
- It employs a Qwen-based Transformer extended with discrete audio tokens from MuCodec to generate structured musical segments with fine-grained prompts.
- The system delivers reproducible research assets, including a licensed synthetic dataset, comprehensive training pipelines, and detailed evaluation protocols for segment-level style control.
Muse is a fully open-source system for long-form song generation with fine-grained style conditioning, introduced to address the non-reproducibility that has characterized much academic work in this area. It combines a licensed synthetic dataset, training and evaluation pipelines, and an easy-to-deploy song generation model. The system is trained by single-stage supervised finetuning of a Qwen-based LLM extended with discrete audio tokens from MuCodec, and it targets controllable segment-level generation across musical structures such as Intro, Verse, Chorus, Bridge, and Outro (Jiang et al., 7 Jan 2026).
1. Definition and research setting
Muse is positioned around a specific research problem: recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data. Muse addresses that gap by releasing a fully open-source system that includes a licensed synthetic dataset, model weights, and training and evaluation pipelines (Jiang et al., 7 Jan 2026).
The system is designed for long-form song generation rather than short audio continuation or isolated clip synthesis. Its conditioning interface is fine-grained: generation is controlled not only by global style labels, but also by segment-level natural-language style descriptions aligned with lyrical structure. The paper’s central claim is that a simple training recipe—single-stage supervised finetuning, no task-specific losses, no auxiliary objectives, and no additional architectural components—can nonetheless achieve competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality. This suggests that the paper treats data construction and conditioning format as central design variables, not merely implementation details (Jiang et al., 7 Jan 2026).
A common misconception is to equate reproducibility with code release alone. In Muse, reproducibility is framed more broadly: the release covers the synthetic corpus, preprocessing, training scripts, evaluation wrappers, and model weights. In that sense, the work is not only a model paper but also an infrastructure paper for controllable long-form song generation research (Jiang et al., 7 Jan 2026).
2. Backbone, tokenization, and sequence formulation
Muse starts from Qwen3-0.6B, a 24-layer, decoder-only Transformer with hidden size 1,024 and 16 attention heads. The standard Qwen vocabulary, around 64K subword tokens, is augmented by 16,384 discrete audio codes from MuCodec, an open-source VQ-VAE audio codec. The joint vocabulary size becomes (Jiang et al., 7 Jan 2026).
The model consumes mixed sequences of text and audio tokens. Token embeddings are formed from an extended embedding table together with absolute positional embeddings, and the stacked Transformer layers produce logits over the joint vocabulary. Beyond extending the token embeddings, no extra architectural components are added; the paper explicitly contrasts this with designs that might introduce cross-modal attention or specialized diffusion blocks (Jiang et al., 7 Jan 2026).
This architectural choice has two implications. First, Muse treats song generation as a unified causal sequence modeling problem over text and discrete audio tokens. Second, the absence of extra modules means that controllability is not delegated to specialized control heads or transition networks. A plausible implication is that the observed segment-level control arises from the conversational training format and segmented supervision rather than from architectural specialization.
3. Synthetic corpus and fine-grained style conditioning
The training corpus consists of 116,489 fully licensed synthetic songs, approximately 7,770 hours in total, synthesized by SunoV5. The language split is 49,692 Chinese songs and 66,797 English songs. Song duration ranges from 2–6 minutes, with an average of approximately 4 minutes (Jiang et al., 7 Jan 2026).
For each song, GPT-5 Mini generates two kinds of text supervision: a set of global style labels—genre, mood, instruments, and vocal type—and lyrics organized into segments such as Intro, Verse, Chorus, Bridge, and Outro. SunoV5 then consumes the text prompt to produce a single-track audio recording, comprising vocals and accompaniment, together with time-aligned lyrics (Jiang et al., 7 Jan 2026).
Style annotation is refined after synthesis. Global styles are re-annotated via Qwen3-Omni-30B audio-to-text and filtered by MuQ-MuLan similarity to ensure at least $0.25$ similarity. Segment-level style is also generated by Qwen3-Omni as a natural-language description for each lyrical segment, and low-similarity cases are re-annotated until all segments exceed $0.25$ (Jiang et al., 7 Jan 2026).
Each song is serialized as a multi-turn “chat.” Turn 1 is a user turn containing global style labels; Turn 2 is an empty assistant turn; subsequent user turns encode each segment in the format "[SegmentName] [desc:…] [lyrics:…] [phoneme:…]"; and the corresponding assistant turns are empty in the input but serve as targets whose content is the audio token span for that segment. This formulation is central to the system: the model learns segment-conditioned audio generation within a causal dialogue template rather than within a specialized symbolic-music or sequence-to-sequence architecture (Jiang et al., 7 Jan 2026).
4. Training regimen, decoding, and controllability
Muse is trained by single-stage supervised fine-tuning from Qwen3-0.6B weights with a single cross-entropy loss over text and audio tokens:
The reported hyperparameters are: 7 total epochs with validation-loss-based checkpoint selection; batch size 1 song with gradient accumulation 8, giving effective batch 8; learning rate with 5% linear warmup and linear decay to zero; AdamW with , , and weight decay $0.1$; and maximum sequence length 15,000 tokens over text and audio. Training uses DeepSpeed ZeRO-3 for memory efficiency (Jiang et al., 7 Jan 2026).
At inference time, the user provides a global instruction of the form “Please generate a song with global style: …” and then, for each structural segment in order, a segment prompt in the format "[SegmentName] [desc:…] [lyrics:…]". Muse emits [BOA]…audio-tokens…[EOA] for each segment and then proceeds to the next segment (Jiang et al., 7 Jan 2026).
Long-form decoding is implemented by concatenating user and assistant turns into one causal context and sampling tokens autoregressively. To ensure reproducibility, greedy decoding with is used by default; failed samples, reported as 6.7%, are re-sampled with and top-$0.25$0. The model uses no explicit transition module. Instead, the paper states that Muse learns to maintain global style and smooth transitions through supervised exposure to fully segmented songs. This is one of the work’s key methodological claims: segment coherence is treated as an emergent property of segmented supervision rather than as the output of a dedicated transition mechanism (Jiang et al., 7 Jan 2026).
5. Evaluation protocol and empirical performance
Muse is evaluated with three classes of metrics. Phoneme Error Rate (PER) compares reference phonemes from ground-truth lyrics against phonemes from Qwen3-ASR transcription using edit distance:
$0.25$1
where $0.25$2, $0.25$3, and $0.25$4 are substitutions, deletions, and insertions, and $0.25$5 is the reference length. Text–Music Style Similarity is measured with MuLan-T using MuQ-MuLan embeddings, both globally and per segment. Audio aesthetic quality is measured with Meta Audiobox Aesthetics—CE, CU, PC, PQ—and SongEval—CO, MU, ME, CL, NA (Jiang et al., 7 Jan 2026).
| Measure | Muse | Comparison |
|---|---|---|
| PER | 0.16 | YuE 0.37, ACE-Step 0.39, LeVo 0.14, DiffRhythm 0.15, Suno 0.09–0.11 |
| Global style similarity | 0.33 | DiffRhythm2 0.37, Suno 0.39–0.42 |
| Segment similarity | 0.31 | Suno 0.34–0.36 |
| CE/CU/PC/PQ | 7.49/7.68/6.61/8.14 | Outperforms YuE and ACE-Step on CE/CU/PC; matches DiffRhythm2 on PQ |
| CO/MU/ME/CL/NA | 4.06/3.88/3.98/3.93/3.87 | Prior open models 3.1–3.5; Suno 4.5–4.6 |
The reported results characterize Muse as competitive rather than uniformly dominant. On PER, Muse at 0.16 is substantially better than YuE and ACE-Step, close to LeVo and DiffRhythm, but still behind Suno’s 0.09–0.11. On global style similarity, Muse reaches 0.33, which the paper describes as competitive with DiffRhythm2 at 0.37 and Suno at 0.39–0.42. On segment similarity, Muse achieves 0.31 versus Suno’s 0.34–0.36, and the paper identifies it as the first open-source system to report segment-level style scores (Jiang et al., 7 Jan 2026).
Audio aesthetics and holistic song evaluation are stronger relative to prior open models. Muse reports Meta Audiobox scores of 7.49/7.68/6.61/8.14 for CE/CU/PC/PQ, outperforming YuE and ACE-Step on CE, CU, and PC, and matching DiffRhythm2 on PQ. In SongEval it reports 4.06/3.88/3.98/3.93/3.87 for CO/MU/ME/CL/NA, significantly above prior open models in the 3.1–3.5 range and narrowing the gap to Suno at 4.5–4.6 (Jiang et al., 7 Jan 2026).
The ablation results support the conditioning design. Removing phoneme supervision slightly lowers PER; removing segment prompts drops segment similarity; and removing global style reduces global similarity and audio scores. Scaling experiments from 0.6B to 8B improve PER from 0.16 to 0.12 and global similarity from 0.33 to 0.35, while segment similarity remains approximately 0.31–0.32. This suggests that segment-level control may be less sensitive to scale than intelligibility and global style matching (Jiang et al., 7 Jan 2026).
6. Reproducibility, future directions, and nomenclature
The reproducibility package is unusually explicit. The GitHub repository includes data-generation scripts, a processed 116k-song JSONL file, training and evaluation pipelines, and model weights. The replication recipe is specified as six steps: clone the repository and download the synthetic dataset; install dependencies including Python, PyTorch, DeepSpeed, tokenizers, and MuCodec; run preprocessing to convert Suno outputs to conversational JSONL and extract phonemes via EmotiVoice; launch training with the provided DeepSpeed script using the Qwen3-0.6B checkpoint; evaluate on a held-out 100-song test set with deterministic decoding and fallback sampling; and compute PER, MuLan-T, Meta Audiobox, and SongEval with the provided wrappers (Jiang et al., 7 Jan 2026).
The paper also states several future directions. These include broader style coverage through more granular tags such as instrumentation mixes and era descriptors; robustness to very long durations through a segment-transition loss or monotonic alignment objective; human-in-the-loop style editing via real-time segment re-prompting to adjust prosody or arrangement; and a voice-cloning extension that safely adapts to specific timbres with few-shot speaker prompts while preserving ethical constraints (Jiang et al., 7 Jan 2026).
A recurrent source of confusion is nomenclature. “MUSE” is a reused name across multiple unrelated research programs, including “MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction” (Zhou et al., 29 Jun 2026), “Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles” (Wang et al., 2024), “MUSE: A Unified Agentic Harness for MLLMs” (Lu et al., 2 Jun 2026), the “Multi-Unit Spectroscopic Explorer” (Kelz et al., 2015), and the “Multi-slit Solar Explorer” (Pontieu et al., 2021). This suggests that disambiguation by title or arXiv identifier is necessary when searching the literature.
Within that broader naming landscape, Muse (Jiang et al., 7 Jan 2026) denotes a reproducible long-form song generation system whose distinguishing properties are a fully licensed synthetic corpus, a conversational segmentation format, single-stage supervised finetuning on joint text–audio tokens, and explicit segment-level style control.