DisCo-Speech: Zero-Shot Disentangled TTS

Updated 17 December 2025

The paper introduces DisCo-Speech, a zero-shot TTS system that disentangles linguistic content, prosody, and speaker timbre at the codec level.
It employs parallel encoders with fine-grained quantization and soft orthogonality losses to achieve clean separation and high-fidelity waveform reconstruction.
The model integrates an autoregressive Transformer LM with BigVGAN-style synthesis for flexible prosody continuation and robust voice cloning.

DisCo-Speech is a zero-shot, controllable text-to-speech (TTS) architecture that addresses the challenge of disentangling linguistic content, prosody, and speaker timbre at the codec level. The approach introduces a disentangled speech codec (DisCodec), paired with an autoregressive Transformer LLM (LM), achieving independent prosody control and voice cloning in a unified framework. This separation of prosody and timbre is accomplished through parallel encoding, quantization, and carefully constructed loss objectives, establishing state-of-the-art flexibility in zero-shot speech synthesis and voice conversion (Li et al., 15 Dec 2025).

1. Disentangled Speech Codec Architecture

DisCodec factorizes an input waveform $x$ into three distinct streams: content ( $c$ ), prosody ( $p$ ), and timbre ( $t$ ), utilizing parallel encoders and finite scalar quantization (FSQ).

Content Encoder ( $E_c$ ): Employs convolutional blocks inspired by DAC, downsampling $x$ into frame-level latent $h_c$ , quantized to discrete content tokens $q_c = Q_c(h_c)$ . Supervision leverages a phone recognizer (fine-tuned Wav2Vec) via cross-entropy loss, ensuring $q_c$ encodes purely linguistic information.
Prosody Encoder ( $E_p$ ): Utilizes dilated causal convolutions (Wavenet architecture) to produce latent $h_p$ , quantized by two-stage residual FSQ into $q_{p1}$ , $q_{p2}$ , summed as $q_p = q_{p1} + q_{p2}$ . Supervision comprises a frame-level F0 (pitch) regression loss, a correlation loss controlling overlap between $q_{p1}$ and $q_{p2}$ (with target similarity $\alpha=0.2$ ), and a gradient-reversal layer (GRL) to minimize timbre leakage.
Timbre Encoder ( $E_t$ ): Based on ECAPA-TDNN applied to Mel-spectrograms, with global aggregation via cross-attention using learnable queries. The resulting global timbre vector $g_t$ is quantized to 48-dimensional token $q_t$ and optimized with a speaker-classification loss to encode only speaker identity.

Tri-factor disentanglement losses use soft orthogonality regularization: $\mathcal{L}_{soft}^{p,c} = \Bigl(\frac{1}{BL} \sum_{b,l} \bigl|\cos(\ell_p^{(b,l)}, \ell_c^{(b,l)})\bigr| - \beta_c \Bigr)^2$

$\mathcal{L}_{soft}^{p,t} = \Bigl(\frac{1}{BL} \sum_{b,l} \bigl|\cos(\ell_p^{(b,l)}, q_t^{(b)})\bigr| - \beta_t \Bigr)^2$

with $\beta_c=0.01$ , $\beta_t=0.0001$ , balancing decoupling with information retention.

The first-stage DAC-style decoder recombines $(q_c, q_p, q_t)$ to reconstruct the waveform, supervised by multi-scale Mel-spectrogram and waveform reconstruction objectives.

2. Fusion for Language Modeling and Token Reconstruction

In the second stage, DisCo-Speech fuses content and prosody into a single token sequence $z_{cp}$ , optimized for joint prediction by the LLM. Encoders are frozen, and a new decoder $D_2$ (built with Transformer blocks and a BigVGANv2 generator) takes tokens $(z_{cp}, q_t)$ to reconstruct the waveform. Fusion is performed as: $z_{cp} = Q_{cp}(\mathrm{Dequantize}(q_c) + \mathrm{Dequantize}(q_p))$ where $Q_{cp}$ is an FSQ quantizer (codebook size 65,536), and the reconstruction objective is a combination of adversarial, feature-matching, and multi-scale spectrogram losses as in BigVGAN.

This staged design resolves the entanglement-reconstruction trade-off by optimizing $\mathcal{L}_{rec}^{(1)}$ for disentangled representation, then $\mathcal{L}_{rec}^{(2)}$ for fidelity, keeping disentangled features fixed.

3. Transformer LM Integration and Inference Pipeline

A standard autoregressive Transformer LM (initialized from Qwen2.5-1.5B) models the conditional distribution over text tokens $t_c$ and fused content–prosody tokens $z_{cp}$ . The model is trained using next-token cross-entropy with input sequences of the form $[S, t_c, T, z_{cp}, E]$ , where $S$ , $T$ , $E$ denote start, turn, and end tokens respectively.

During inference, the procedure is:

Style prompt: Given a reference speech sample $x'$ and its transcript $t_c'$ , extract $(z_{cp}', q_t')$ .
Context construction: Build sequence $[S, t_c', T, z_{cp}', T, t_c^{sys}]$ where $t_c^{sys}$ is the new target text.
Prosodic continuation: The LM autoregressively generates $z_{cp}^{sys}$ , extending the prosody pattern of the prompt onto the target text.
Timbre injection: The decoder $D_2$ synthesizes the waveform from $(z_{cp}^{sys}, q_t^{target})$ , injecting the desired timbre.

This design ensures clean separation of prosody continuation (via LM) and timbre injection (via the decoder), supporting zero-shot prosody transfer with arbitrary style prompts.

4. Training Objectives, Regularization, and Optimization

Training proceeds in two stages:

Stage 1 (Disentanglement): Joint optimization of reconstruction, phone, F0, correlation, soft orthogonality, and speaker losses: $\mathcal{L}^{(1)} = \mathcal{L}_{rec}^{(1)} + \lambda_{pho}\mathcal{L}_{pho} + \lambda_{f0}\mathcal{L}_{f0} + \lambda_{cor}\mathcal{L}_{cor} + \lambda_{sc}\mathcal{L}_{soft}^{p,c} + \lambda_{st}\mathcal{L}_{soft}^{p,t} + \lambda_{spk}\mathcal{L}_{spk}.$
Stage 2 (Fusion and Fidelity): Only the fusion decoder $D_2$ is trained (with encoders frozen) on the BigVGAN-style loss.
LLM: Pre-training and fine-tuning on 120,000 hours of speech data, using Adam optimizer (with learning rates: stages 1/2: $1\times10^{-4}$ , LM: $2\times10^{-4}$ , batch size 176 on eight A800 GPUs, 500k steps in stage 1).

Hyperparameters (codebook sizes, GRL for prosody, orthogonality regularization) are tuned to balance disentanglement and information retention.

5. Experimental Evaluation and Results

Experiments span codec reconstruction, voice conversion, and zero-shot prosody/voice control. Key metrics and benchmarks include:

Evaluation Task	Metrics & Results	Comparative Baselines
Codec Reconstruction	WER↓=3.44%, STOI↑=0.88, PESQ WB↑=2.31, UTMOS↑=4.10, SSIM↑=0.81	Top single-stream codecs
Voice Conversion	UTMOS↑=3.98, SSIM↑=61.1%, F0 corr↑=0.58 (1,680 pairs)	SeedVC, Vevo
Prosody Control (AB)	Timbre: 45.3–51.5% vs. Vevo (21.3–40.2%); Prosody: 48.9–50.6% vs. Vevo (20.0–36.7%)	Vevo, IndexTTS 2
Voice Cloning (EN/ZH)	WER=3.08%, SSIM=58.8% (EN); CER=1.64%, SSIM=68.1% (ZH)	Spark-TTS, multistage

DisCo-Speech consistently demonstrates high speaker similarity (SSIM), intelligibility (WER/CER), and prosody preservation, while outperforming strong baselines in zero-shot prosody control and disentanglement. Subjective AB preference also favors DisCo-Speech for timbre and prosody transfer in style/emotion tasks.

6. Significance, Limitations, and Implications

By explicitly factorizing speech into content, prosody, and timbre at the codec level and fusing only content–prosody for language modeling, DisCo-Speech resolves the entanglement–reconstruction trade-off that has hindered previous continuation-based TTS architectures. This clean separation enables flexible, zero-shot prosody continuation and robust voice cloning, establishing a foundation for controllable TTS systems. A plausible implication is enhanced downstream adaptability in scenarios requiring fine-grained speech parameter control.

No claims or results suggest fundamental limitations within the described evaluation regime; however, as with all TTS systems, scaling to new languages and domains may require adaptation of encoders or quantization strategies.

7. Availability and Reproducibility

Audio samples, code, and model weights for DisCo-Speech are available at https://github.com/disco-speech/DisCo-Speech (Li et al., 15 Dec 2025). The reproducibility of results is facilitated by the release of code and pretrained weights alongside full objective definitions and architecture details.

Markdown Upgrade to Chat

References (1)

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DisCo-Speech.