The paper presents a unified framework for text-to-song generation that addresses the inherent challenges of generating coherent vocals and musical accompaniment in a single-stage process. The model, SongGen, leverages an auto-regressive transformer architecture operating over discrete audio tokens produced by an audio codec (X-Codec). It supports fine-grained controllability via conditioning on lyrics, text descriptions, and an optional voice reference. The approach is distinguished by its capability to generate both vocals and accompaniment under a single inference pipeline, thus simplifying traditional multi-stage generation systems.
The model operates in two distinct modes:
- Mixed Mode In mixed mode, the model directly predicts a single sequence of mixed audio tokens, which represent the superposition of vocals and accompaniment. To mitigate the inherent learning bias (where the model favors high-energy accompaniment over the sparser, semantically rich vocal components), an auxiliary vocal token prediction branch is introduced. This branch, active only during training, enforces additional supervision on vocal features and improves vocal clarity. The training loss is formulated as follows:
- : weighted sum of cross-entropy losses over codebooks, with early codebooks prioritized using weights such that for .
- : auxiliary loss computed on vocal tokens extracted via frame-level alignment with the mixed audio tokens.
- : hyperparameter controlling the contribution of vocal loss.
- Dual-Track Mode
- Parallel Pattern: Tokens for vocals and accompaniment are concatenated along the codebook dimension with variants that control the temporal alignment via deliberate delays between the tracks.
- Interleaving Pattern: Vocal and accompaniment tokens are interlaced in the temporal dimension; empirical results indicate that when the accompaniment tokens precede the vocal tokens at each frame, the interleaving variant achieves superior vocal quality and competitive fidelity overall.
- The corresponding training loss is defined as the mean of the vocal and accompaniment losses:
Conditioning Mechanisms
The model incorporates multiple modalities for control:
- Lyrics Conditioning: Input lyrics are tokenized with a 6681-token VoiceBPE scheme and encoded using a compact transformer-based encoder. This design is shown to capture phoneme-level duration and pitch variations essential for sung vocals.
- Voice Conditioning: A frozen MERT encoder extracts robust voice features from a 3-second reference clip, enabling control over timbre and singing technique.
- Text Conditioning: Detailed musical attributes (e.g., instrumentation, genre, mood) are embedded using a frozen FLAN-T5 encoder. The final conditioned embedding is obtained by a temporal concatenation of projected lyrics, voice, and text embeddings:
where each denotes the projected embedding and indicates concatenation.
Data Preprocessing and Training Scheme
Addressing data scarcity in text-to-song generation, the work proposes an automated pipeline that:
- Sources audio from diverse public datasets.
- Applies state-of-the-art source separation (using Demucs) to extract vocals and accompaniment.
- Uses voice activity detection (VAD) and energy-based filtering to segment and select high-quality clips (average duration around 15 seconds).
- Employs dual ASR transcriptions (via Whisper-large models) to filter clips based on an edit distance criterion, ensuring reliable lyric transcriptions.
- Supplements samples with accurate captions, generating pseudo-captions when necessary using a dedicated music captioning model.
The final dataset comprises approximately 540K English-voiced clips (over 2K hours). The training strategy involves:
- Modality Alignment: Jointly training the transformer decoder and conditioning modules on the complete dataset.
- Voice-Free Adaptation: Introducing a 50% random drop on the voice input so that the system can operate without a reference.
- High-Quality Fine-Tuning (HQFT): Refining the model on a filtered subset of approximately 100K samples based on strict energy, transcription accuracy (edit distance), and CLAP score thresholds.
- Curriculum Learning for Codebook Losses: Gradually adjusting the weights of the codebook losses to focus on the most semantically significant codebooks first, before balancing them for finer audio detail reconstruction.
Experimental Results and Analysis
Objective metrics include Frechet Audio Distance (FAD), Kullback-Leibler Divergence (KL), CLAP Score for audio-text alignment, Phoneme Error Rate (PER), and Speaker Embedding Cosine Similarity (SECS). In experiments conducted on a filtered subset of the MusicCaps benchmark:
- The Mixed Pro model achieves an FAD of 1.71, PER of 40.58, and improved vocal quality scores.
- In dual-track mode, the Interleaving (A-V) pattern records competitive performance with FAD of 1.87, enhanced vocal quality, and slightly better perceptual metrics compared to the mixed mode.
- Detailed attention analysis reveals that, in the dual-track interleaving approach, lower transformer layers capture inter-track interactions whilst higher layers refine intra-track characteristics, as evidenced by checkerboard and diagonal patterns in attention maps.
Further ablation studies validate:
- The effectiveness of the HQFT and curriculum learning strategies, with significant improvements across FAD, KL, and PER metrics.
- The superiority of integrating lyrics using a VoiceBPE tokenizer combined with cross-attention (as opposed to pre-pending tokens) for stable and clear vocal generation.
- The choice of X-Codec, which integrates both acoustic and semantic information, outperforms other codecs (Encodec, DAC) in both final audio quality and training convergence.
Limitations and Future Work
The current implementation restricts generated song durations to a maximum of 30 seconds and utilizes an audio codec at 16 kHz, limiting fidelity. Future research directions include extending song length, enhancing the audio renderer for higher sampling rates, and exploring more sophisticated upsampling techniques.
Impact Statement
While SongGen provides a promising framework for accessible, controllable music generation—including zero-shot voice cloning—it also raises important considerations regarding copyright, misuse, and deepfake audio generation. The authors thus emphasize the need for appropriate safeguards and ethical constraints when deploying such systems.
This comprehensive approach establishes a new baseline for text-to-song generation, revealing valuable insights for controlling musical attributes in an integrated end-to-end model.