Papers
Topics
Authors
Recent
2000 character limit reached

DisCo-Speech: Zero-Shot Disentangled TTS

Updated 17 December 2025
  • The paper introduces DisCo-Speech, a zero-shot TTS system that disentangles linguistic content, prosody, and speaker timbre at the codec level.
  • It employs parallel encoders with fine-grained quantization and soft orthogonality losses to achieve clean separation and high-fidelity waveform reconstruction.
  • The model integrates an autoregressive Transformer LM with BigVGAN-style synthesis for flexible prosody continuation and robust voice cloning.

DisCo-Speech is a zero-shot, controllable text-to-speech (TTS) architecture that addresses the challenge of disentangling linguistic content, prosody, and speaker timbre at the codec level. The approach introduces a disentangled speech codec (DisCodec), paired with an autoregressive Transformer LLM (LM), achieving independent prosody control and voice cloning in a unified framework. This separation of prosody and timbre is accomplished through parallel encoding, quantization, and carefully constructed loss objectives, establishing state-of-the-art flexibility in zero-shot speech synthesis and voice conversion (Li et al., 15 Dec 2025).

1. Disentangled Speech Codec Architecture

DisCodec factorizes an input waveform xx into three distinct streams: content (cc), prosody (pp), and timbre (tt), utilizing parallel encoders and finite scalar quantization (FSQ).

  • Content Encoder (EcE_c): Employs convolutional blocks inspired by DAC, downsampling xx into frame-level latent hch_c, quantized to discrete content tokens qc=Qc(hc)q_c = Q_c(h_c). Supervision leverages a phone recognizer (fine-tuned Wav2Vec) via cross-entropy loss, ensuring qcq_c encodes purely linguistic information.
  • Prosody Encoder (EpE_p): Utilizes dilated causal convolutions (Wavenet architecture) to produce latent hph_p, quantized by two-stage residual FSQ into qp1q_{p1}, qp2q_{p2}, summed as qp=qp1+qp2q_p = q_{p1} + q_{p2}. Supervision comprises a frame-level F0 (pitch) regression loss, a correlation loss controlling overlap between qp1q_{p1} and qp2q_{p2} (with target similarity α=0.2\alpha=0.2), and a gradient-reversal layer (GRL) to minimize timbre leakage.
  • Timbre Encoder (EtE_t): Based on ECAPA-TDNN applied to Mel-spectrograms, with global aggregation via cross-attention using learnable queries. The resulting global timbre vector gtg_t is quantized to 48-dimensional token qtq_t and optimized with a speaker-classification loss to encode only speaker identity.

Tri-factor disentanglement losses use soft orthogonality regularization: Lsoftp,c=(1BLb,lcos(p(b,l),c(b,l))βc)2\mathcal{L}_{soft}^{p,c} = \Bigl(\frac{1}{BL} \sum_{b,l} \bigl|\cos(\ell_p^{(b,l)}, \ell_c^{(b,l)})\bigr| - \beta_c \Bigr)^2

Lsoftp,t=(1BLb,lcos(p(b,l),qt(b))βt)2\mathcal{L}_{soft}^{p,t} = \Bigl(\frac{1}{BL} \sum_{b,l} \bigl|\cos(\ell_p^{(b,l)}, q_t^{(b)})\bigr| - \beta_t \Bigr)^2

with βc=0.01\beta_c=0.01, βt=0.0001\beta_t=0.0001, balancing decoupling with information retention.

The first-stage DAC-style decoder recombines (qc,qp,qt)(q_c, q_p, q_t) to reconstruct the waveform, supervised by multi-scale Mel-spectrogram and waveform reconstruction objectives.

2. Fusion for Language Modeling and Token Reconstruction

In the second stage, DisCo-Speech fuses content and prosody into a single token sequence zcpz_{cp}, optimized for joint prediction by the LLM. Encoders are frozen, and a new decoder D2D_2 (built with Transformer blocks and a BigVGANv2 generator) takes tokens (zcp,qt)(z_{cp}, q_t) to reconstruct the waveform. Fusion is performed as: zcp=Qcp(Dequantize(qc)+Dequantize(qp))z_{cp} = Q_{cp}(\mathrm{Dequantize}(q_c) + \mathrm{Dequantize}(q_p)) where QcpQ_{cp} is an FSQ quantizer (codebook size 65,536), and the reconstruction objective is a combination of adversarial, feature-matching, and multi-scale spectrogram losses as in BigVGAN.

This staged design resolves the entanglement-reconstruction trade-off by optimizing Lrec(1)\mathcal{L}_{rec}^{(1)} for disentangled representation, then Lrec(2)\mathcal{L}_{rec}^{(2)} for fidelity, keeping disentangled features fixed.

3. Transformer LM Integration and Inference Pipeline

A standard autoregressive Transformer LM (initialized from Qwen2.5-1.5B) models the conditional distribution over text tokens tct_c and fused content–prosody tokens zcpz_{cp}. The model is trained using next-token cross-entropy with input sequences of the form [S,tc,T,zcp,E][S, t_c, T, z_{cp}, E], where SS, TT, EE denote start, turn, and end tokens respectively.

During inference, the procedure is:

  1. Style prompt: Given a reference speech sample xx' and its transcript tct_c', extract (zcp,qt)(z_{cp}', q_t').
  2. Context construction: Build sequence [S,tc,T,zcp,T,tcsys][S, t_c', T, z_{cp}', T, t_c^{sys}] where tcsyst_c^{sys} is the new target text.
  3. Prosodic continuation: The LM autoregressively generates zcpsysz_{cp}^{sys}, extending the prosody pattern of the prompt onto the target text.
  4. Timbre injection: The decoder D2D_2 synthesizes the waveform from (zcpsys,qttarget)(z_{cp}^{sys}, q_t^{target}), injecting the desired timbre.

This design ensures clean separation of prosody continuation (via LM) and timbre injection (via the decoder), supporting zero-shot prosody transfer with arbitrary style prompts.

4. Training Objectives, Regularization, and Optimization

Training proceeds in two stages:

  • Stage 1 (Disentanglement): Joint optimization of reconstruction, phone, F0, correlation, soft orthogonality, and speaker losses: L(1)=Lrec(1)+λphoLpho+λf0Lf0+λcorLcor+λscLsoftp,c+λstLsoftp,t+λspkLspk.\mathcal{L}^{(1)} = \mathcal{L}_{rec}^{(1)} + \lambda_{pho}\mathcal{L}_{pho} + \lambda_{f0}\mathcal{L}_{f0} + \lambda_{cor}\mathcal{L}_{cor} + \lambda_{sc}\mathcal{L}_{soft}^{p,c} + \lambda_{st}\mathcal{L}_{soft}^{p,t} + \lambda_{spk}\mathcal{L}_{spk}.
  • Stage 2 (Fusion and Fidelity): Only the fusion decoder D2D_2 is trained (with encoders frozen) on the BigVGAN-style loss.
  • LLM: Pre-training and fine-tuning on 120,000 hours of speech data, using Adam optimizer (with learning rates: stages 1/2: 1×1041\times10^{-4}, LM: 2×1042\times10^{-4}, batch size 176 on eight A800 GPUs, 500k steps in stage 1).

Hyperparameters (codebook sizes, GRL for prosody, orthogonality regularization) are tuned to balance disentanglement and information retention.

5. Experimental Evaluation and Results

Experiments span codec reconstruction, voice conversion, and zero-shot prosody/voice control. Key metrics and benchmarks include:

Evaluation Task Metrics & Results Comparative Baselines
Codec Reconstruction WER↓=3.44%, STOI↑=0.88, PESQ WB↑=2.31, UTMOS↑=4.10, SSIM↑=0.81 Top single-stream codecs
Voice Conversion UTMOS↑=3.98, SSIM↑=61.1%, F0 corr↑=0.58 (1,680 pairs) SeedVC, Vevo
Prosody Control (AB) Timbre: 45.3–51.5% vs. Vevo (21.3–40.2%); Prosody: 48.9–50.6% vs. Vevo (20.0–36.7%) Vevo, IndexTTS 2
Voice Cloning (EN/ZH) WER=3.08%, SSIM=58.8% (EN); CER=1.64%, SSIM=68.1% (ZH) Spark-TTS, multistage

DisCo-Speech consistently demonstrates high speaker similarity (SSIM), intelligibility (WER/CER), and prosody preservation, while outperforming strong baselines in zero-shot prosody control and disentanglement. Subjective AB preference also favors DisCo-Speech for timbre and prosody transfer in style/emotion tasks.

6. Significance, Limitations, and Implications

By explicitly factorizing speech into content, prosody, and timbre at the codec level and fusing only content–prosody for language modeling, DisCo-Speech resolves the entanglement–reconstruction trade-off that has hindered previous continuation-based TTS architectures. This clean separation enables flexible, zero-shot prosody continuation and robust voice cloning, establishing a foundation for controllable TTS systems. A plausible implication is enhanced downstream adaptability in scenarios requiring fine-grained speech parameter control.

No claims or results suggest fundamental limitations within the described evaluation regime; however, as with all TTS systems, scaling to new languages and domains may require adaptation of encoders or quantization strategies.

7. Availability and Reproducibility

Audio samples, code, and model weights for DisCo-Speech are available at https://github.com/disco-speech/DisCo-Speech (Li et al., 15 Dec 2025). The reproducibility of results is facilitated by the release of code and pretrained weights alongside full objective definitions and architecture details.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DisCo-Speech.