DiTSinger: Scalable Singing Synthesis

Updated 13 October 2025

The paper introduces DiTSinger as a scalable SVS system that utilizes a diffusion transformer architecture with an implicit phoneme-to-acoustic alignment mechanism for alignment-free synthesis.
It employs innovative techniques such as Rotary Positional Encoding, qk-normalization, and Adaptive Layer Normalization to ensure model stability and high fidelity across varying scales.
The system integrates a two-stage data pipeline—combining LLM-generated diverse lyrics with professional recordings—to achieve improved objective metrics and subjective quality.

DiTSinger is a scalable singing voice synthesis system centered on a diffusion transformer architecture with implicit phoneme-to-acoustic alignment. DiTSinger advances the state-of-the-art in Singing Voice Synthesis (SVS) by integrating systematic model scaling, large-scale data generation with diverse linguistic content, and alignment techniques that bypass the need for explicit phoneme-level duration labels. The system demonstrates alignment-free, high-fidelity SVS and achieves improved objective and subjective fidelity across a dataset of approximately 530 hours of Chinese singing collected from both human recordings and model-generated samples (Du et al., 10 Oct 2025).

1. Diffusion Transformer Architecture

At the core of DiTSinger is a latent diffusion model parameterized by a transformer stack (Diffusion Transformer, "DiT"). Key architectural components include:

Rotary Positional Encoding (RoPE): Injects relative positional information directly into query and key projections of the Multi-Head Self-Attention (MHSA) modules, providing rotational invariance and enabling the modeling of long temporal dependencies necessary for singing synthesis.
qk-normalization (qk-norm): Stabilizes scaling in attention computations for both self-attention and cross-attention branches as the model is scaled in depth (number of layers) and width (hidden dimension).

Each DiTBlock in the denoising network contains:

MHSA (with RoPE and qk-norm)
Masked Multi-Head Cross-Attention (MHCA), integrating detailed phoneme and score conditions, and also using qk-norm
Pointwise FeedForward Network (FFN)

Each branch is further modulated by Adaptive Layer Normalization (AdaLN) with learnable parameters $[\gamma_i, \beta_i]$ and scaled by residual weights $[\alpha_i]$ .

Fine-grained conditioning is achieved by summing the embeddings for pitch, phoneme, word duration, and slur indicators:

$h_{local} = Enc_{cond}(E_p(p) + E_{ph}(ph) + E_w(w) + E_{sl}(sl))$

Cross-attention integrates a mask $M$ constructed by the implicit alignment mechanism:

$Attention(Q, K, V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d}} + M \right)V$

This structure allows DiTSinger to robustly condition noise prediction at each denoising step on both fine-grained and coarse-grained inputs, including speaker identity and diffusion timestep.

2. Implicit Alignment Mechanism

To avoid reliance on phoneme-level duration labels and enhance robustness under noisy temporal annotations, DiTSinger employs an implicit alignment mechanism:

Phonemes inherit the temporal span of corresponding characters, defined by start time $t_{start}$ and duration $d_{char}$ , adjusted by a tunable offset $\delta$ and the previous character's duration $d_{prev}$ :

$\tilde{t}_{start} = t_{start} - \min(\delta, d_{char}, d_{prev}) \ t_{end} = t_{start} + d_{char}$

An attention mask $M \in \mathbb{R}^{L_{mel} \times L_{ph}}$ is constructed such that:

$M_{i, j} = \begin{cases} 0 & \text{if } t_i \in [\tilde{t}_{start}, t_{end}] \ -\infty & \text{else} \end{cases}$

This mask, applied in cross-attention, restricts phoneme-to-acoustic attention to valid character intervals, replacing direct duration supervision and enhancing alignment under noisy or uncertain data.

3. Data Construction and Training Pipeline

DiTSinger utilizes a two-stage scalable data pipeline:

Recording-Fitting Phase: A compact seed dataset is assembled by pairing fixed melodies with linguistically diverse lyrics generated by a LLM. Professional singers record these melodies, ensuring clean, high-quality vocal samples. Melody-specific models ("PseudoSinger") are trained on each fixed-melody group to produce accurate melody-conditioned synthesis.
Data Expansion Phase: The trained PseudoSinger is leveraged to synthesize large volumes of singing data by rendering additional LLM-generated lyrics paired with the fixed melodies. This method efficiently produces an extensive corpus (~500 hours) with high phonetic diversity and consistent melodic alignment. A base model is pre-trained on the publicly available M4Singer dataset and fine-tuned on both the natural and synthesized samples.

This pipeline allows for large-scale, diverse, and high-quality SVS training data without exhaustive manual annotation.

4. Model Scalability and Fidelity

Scalability is achieved by scaling both data and model capacity:

Model Scaling: DiTSinger systematically increases the depth (number of DiTBlocks), width (hidden dimension), and resolution of its transformer backbone. For instance, experiments contrast Small (depth 4), Base (depth 8), and Large (depth 16) configurations; it is observed that a higher temporal resolution (e.g., S_2 model) can sometimes yield better fidelity than depth scaling alone.
Data Scaling: Training data scale from 30 to ~530 hours yields monotonically improved performance, both objectively and subjectively.
Conditioning Strategy: Integration of both fine- and coarse-grained conditions (linguistic, phonetic, prosodic, speaker identity, and diffusive timestep) yields rich expressiveness.

Experiments indicate that DiTSinger achieves lower Mel Cepstral Distortion (MCD), lower F0 root mean square error (F0RMSE), and higher mean opinion scores (MOS) compared to contemporary SVS models such as DiffSinger, StyleSinger, and TCSinger.

5. Experimental Validation

The effectiveness of DiTSinger is validated through comprehensive experiments:

Trained on ~530 hours from 40 professional vocalists, using a mixture of model-generated and publicly available singing data.
Objective metrics: MCD, FFE (frames with abnormal pitch/voicing), F0RMSE.
Subjective metric: MOS (1-5 scale).
Scaling experiments confirm monotonic improvements with increased data and model size; DiTSinger_L achieves the best scores.
Ablation studies with PseudoSinger training show that approximately 20 fixed-melody groups provide optimal balance: too few result in unstable articulation, while too many reduce generalization due to sample scarcity per group.
Compared against state-of-the-art baselines, DiTSinger demonstrates superior performance across both objective and subjective criteria.

6. Significance and Implications

DiTSinger’s design addresses key challenges in SVS:

Alignment-Free Synthesis: The implicit alignment mechanism removes the need for painstaking phoneme-level timing annotation, relying instead on character-level spans and attention bias, providing robustness against label noise.
Scalable Data Generation: The combination of LLM-generated diverse lyrics, professional recordings, and model-driven rendering enables efficient large-scale corpus construction, enhancing phonetic and emotional diversity.
Objective and Subjective Gains: Improved fidelity is demonstrated across standard SVS metrics and user ratings. The enhanced architecture and training regime enable consistent, high-quality synthesis as scale increases.

A plausible implication is that similar implicit alignment and scalable data approaches could be extended to other generative tasks requiring fine-grained temporal-linguistic conditioning, where annotation is costly or imprecise.

7. Comparative Overview

The following table summarizes distinguishing features with baseline SVS systems:

System	Alignment Approach	Model Core	Data Scaling
DiTSinger	Implicit (character mask)	Transformer-based diffusion (with RoPE, qk-norm)	LLM-generated lyrics + recorded fixed melodies (~500h)
DiffSinger	Explicit (phoneme labels)	Shallow diffusion acoustic model	Real singing dataset + shallow diffusion
StyleSinger / TCSinger	Explicit	Various deep learning models	Standard SVS datasets

DiTSinger advances on alignment robustness, scalability, and fidelity in SVS relative to previously published methods (Liu et al., 2021, Du et al., 10 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment (2025)

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (2021)

Follow Topic

Get notified by email when new papers are published related to DiTSinger.