SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation (2502.13128v2)

Published 18 Feb 2025 in cs.SD and cs.AI

Abstract: Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The code is available at https://github.com/LiuZH-19/SongGen.

PDF Abstract

The paper presents a unified framework for text-to-song generation that addresses the inherent challenges of generating coherent vocals and musical accompaniment in a single-stage process. The model, SongGen, leverages an auto-regressive transformer architecture operating over discrete audio tokens produced by an audio codec (X-Codec). It supports fine-grained controllability via conditioning on lyrics, text descriptions, and an optional voice reference. The approach is distinguished by its capability to generate both vocals and accompaniment under a single inference pipeline, thus simplifying traditional multi-stage generation systems.

The model operates in two distinct modes:

Mixed Mode In mixed mode, the model directly predicts a single sequence of mixed audio tokens, which represent the superposition of vocals and accompaniment. To mitigate the inherent learning bias (where the model favors high-energy accompaniment over the sparser, semantically rich vocal components), an auxiliary vocal token prediction branch is introduced. This branch, active only during training, enforces additional supervision on vocal features and improves vocal clarity. The training loss is formulated as follows:

$\mathcal{L}_{\text{mixed-pro}} = \mathcal{L}_{\text{mixed}} + \lambda \, \mathcal{L}_{\text{vocal}}$

$\mathcal{L}_{\text{mixed}}$ : weighted sum of cross-entropy losses over codebooks, with early codebooks prioritized using weights $w_k$ such that $w_k \leq w_j$ for $k < j$ .
$\mathcal{L}_{\text{vocal}}$ : auxiliary loss computed on vocal tokens extracted via frame-level alignment with the mixed audio tokens.
$\lambda$ $λ$ : hyperparameter controlling the contribution of vocal loss.
- Dual-Track Mode
Parallel Pattern: Tokens for vocals and accompaniment are concatenated along the codebook dimension with variants that control the temporal alignment via deliberate delays between the tracks.
Interleaving Pattern: Vocal and accompaniment tokens are interlaced in the temporal dimension; empirical results indicate that when the accompaniment tokens precede the vocal tokens at each frame, the interleaving variant achieves superior vocal quality and competitive fidelity overall.
The corresponding training loss is defined as the mean of the vocal and accompaniment losses:

$\mathcal{L}_{\text{parallel/interleaving}} = \frac{1}{2} (\mathcal{L}_{\text{vocal}} + \mathcal{L}_{\text{acc}})$

Conditioning Mechanisms

The model incorporates multiple modalities for control:

Lyrics Conditioning: Input lyrics are tokenized with a 6681-token VoiceBPE scheme and encoded using a compact transformer-based encoder. This design is shown to capture phoneme-level duration and pitch variations essential for sung vocals.
Voice Conditioning: A frozen MERT encoder extracts robust voice features from a 3-second reference clip, enabling control over timbre and singing technique.
Text Conditioning: Detailed musical attributes (e.g., instrumentation, genre, mood) are embedded using a frozen FLAN-T5 encoder. The final conditioned embedding is obtained by a temporal concatenation of projected lyrics, voice, and text embeddings:

$E_{\text{cond}} = \hat{E}_{\text{voice}} \oplus \hat{E}_{\text{text}} \oplus \hat{E}_{\text{lyrics}}$

where each $\hat{E}$ denotes the projected embedding and $\oplus$ indicates concatenation.

Data Preprocessing and Training Scheme

Addressing data scarcity in text-to-song generation, the work proposes an automated pipeline that:

Sources audio from diverse public datasets.
Applies state-of-the-art source separation (using Demucs) to extract vocals and accompaniment.
Uses voice activity detection (VAD) and energy-based filtering to segment and select high-quality clips (average duration around 15 seconds).
Employs dual ASR transcriptions (via Whisper-large models) to filter clips based on an edit distance criterion, ensuring reliable lyric transcriptions.
Supplements samples with accurate captions, generating pseudo-captions when necessary using a dedicated music captioning model.

The final dataset comprises approximately 540K English-voiced clips (over 2K hours). The training strategy involves:

Modality Alignment: Jointly training the transformer decoder and conditioning modules on the complete dataset.
Voice-Free Adaptation: Introducing a 50% random drop on the voice input so that the system can operate without a reference.
High-Quality Fine-Tuning (HQFT): Refining the model on a filtered subset of approximately 100K samples based on strict energy, transcription accuracy (edit distance), and CLAP score thresholds.
Curriculum Learning for Codebook Losses: Gradually adjusting the weights of the codebook losses to focus on the most semantically significant codebooks first, before balancing them for finer audio detail reconstruction.

Experimental Results and Analysis

Objective metrics include Frechet Audio Distance (FAD), Kullback-Leibler Divergence (KL), CLAP Score for audio-text alignment, Phoneme Error Rate (PER), and Speaker Embedding Cosine Similarity (SECS). In experiments conducted on a filtered subset of the MusicCaps benchmark:

The Mixed Pro model achieves an FAD of 1.71, PER of 40.58, and improved vocal quality scores.
In dual-track mode, the Interleaving (A-V) pattern records competitive performance with FAD of 1.87, enhanced vocal quality, and slightly better perceptual metrics compared to the mixed mode.
Detailed attention analysis reveals that, in the dual-track interleaving approach, lower transformer layers capture inter-track interactions whilst higher layers refine intra-track characteristics, as evidenced by checkerboard and diagonal patterns in attention maps.

Further ablation studies validate:

The effectiveness of the HQFT and curriculum learning strategies, with significant improvements across FAD, KL, and PER metrics.
The superiority of integrating lyrics using a VoiceBPE tokenizer combined with cross-attention (as opposed to pre-pending tokens) for stable and clear vocal generation.
The choice of X-Codec, which integrates both acoustic and semantic information, outperforms other codecs (Encodec, DAC) in both final audio quality and training convergence.

Limitations and Future Work

The current implementation restricts generated song durations to a maximum of 30 seconds and utilizes an audio codec at 16 kHz, limiting fidelity. Future research directions include extending song length, enhancing the audio renderer for higher sampling rates, and exploring more sophisticated upsampling techniques.

Impact Statement

While SongGen provides a promising framework for accessible, controllable music generation—including zero-shot voice cloning—it also raises important considerations regarding copyright, misuse, and deepfake audio generation. The authors thus emphasize the need for appropriate safeguards and ethical constraints when deploying such systems.

This comprehensive approach establishes a new baseline for text-to-song generation, revealing valuable insights for controlling musical attributes in an integrated end-to-end model.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zihan Liu (102 papers)
Shuangrui Ding (22 papers)
Zhixiong Zhang (8 papers)
Xiaoyi Dong (73 papers)
Pan Zhang (153 papers)
Yuhang Zang (54 papers)
Yuhang Cao (41 papers)
Dahua Lin (336 papers)
Jiaqi Wang (218 papers)

Related Papers

Find Related Papers

GitHub

SongGen
GitHub - LiuZH-19/SongGen (10 stars)

Tweets

https://twitter.com/multimodalart/status/1892534052348887192

https://twitter.com/atsushieeeee/status/1894016586023141725

https://twitter.com/javaeeeee1/status/1892540949915394404