Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 158 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Transcript-Preserving Speaker Transfer (TPST)

Updated 16 November 2025
  • TPST is a method that converts speech into speaker-invariant discrete units to preserve linguistic content while transferring speaker identity.
  • It integrates a UnitEncoder and a diffusion-based TTS backbone to enable both text-to-speech and any-to-any voice conversion without explicit transcripts.
  • Empirical results demonstrate high transcript fidelity and competitive speaker similarity, though slightly higher errors are observed for voice conversion.

Transcript-Preserving Speaker Transfer (TPST) is a technique that enables the adaptation of synthetic speech to arbitrary new speakers using minimal, untranscribed adaptation data, while strictly maintaining the original linguistic transcript. In the UnitSpeech framework, TPST is accomplished through the integration of self-supervised, speaker-invariant discrete units as pseudo-transcripts into a diffusion-based text-to-speech (TTS) system. The method allows both text-to-speech and any-to-any voice conversion without explicit transcripts or extensive retraining, ensuring high content preservation and maintaining speaker identity across synthesis tasks (Kim et al., 2023).

1. Self-Supervised Discrete Unit Representation

TPST in UnitSpeech begins by converting untranscribed source speech into discrete unit sequences using a self-supervised pipeline. Each speech frame xix_i is first embedded via a pre-trained HuBERT model, yielding hidden states hi=HuBERT(xi)h_i = \mathrm{HuBERT}(x_i). These are clustered with KK-means to produce cluster centroids {μk}\{\mu_k\}, and each frame is assigned a unit label: ui=argmink  hiμku_i = \mathrm{arg\,min}_{k}\;\|h_i - \mu_k\| The resulting sequence u=(u1,...,uLu)u = (u_1, ..., u_{L_u}) represents the entire utterance as discrete units, abstracting away speaker characteristics and containing only linguistic and phonetic content. This unit sequence is further processed to match the temporality of the mel-spectrogram frames—upsampling and transforming repeated indices into explicit durations dud_u. Both uu and dud_u are fed to a UnitEncoder, which shares the architecture with the TTS text encoder, outputting a continuous embedding sequence cu=UnitEncoder(u,du)RT×Dc_u = \mathrm{UnitEncoder}(u, d_u)\in\mathbb{R}^{T\times D}. Because the only difference lies in input modality, cuc_u and the phoneme-based embedding cyc_y are trained to occupy a shared "content space".

2. Diffusion-Based Synthesis Backbone

Speech synthesis and transfer are performed with a diffusion-based decoder, specifically a multi-speaker Grad-TTS backbone. Given a target mel-spectrogram X0RT×MX_0\in\mathbb{R}^{T\times M} and a speaker embedding eSREe_S\in\mathbb{R}^E, the forward process follows a continuous-time stochastic differential equation (SDE): dXt=12βtXtdt+βtdWt,t[0,1]dX_t = -\tfrac12\,\beta_t\,X_t\,dt + \sqrt{\beta_t}\,dW_t,\quad t\in[0,1] The reverse process is realized via discretized Euler–Maruyama sampling. The decoder is trained with a score-matching loss: Lgrad=Et,X0,εt[λtsθ(Xt,t  c,eS)+εt22]L_{\mathrm{grad}} = \mathbb{E}_{t,X_0,\varepsilon_t}\Bigl[\bigl\|\sqrt{\lambda_t}\,s_\theta(X_t, t\,|\;c, e_S) + \varepsilon_t\bigr\|_2^2\Bigr] where cc may be cyc_y (text) or cuc_u (unit), and an additional encoder alignment loss Lenc=cX022L_{\mathrm{enc}} = \|\,c - X_0\|_2^2 encourages the content embedding to align with the spectrogram and be speaker-independent. During inference, the decoder generates speech conditioned either on phonemes (TTS) or units (VC).

3. Training Protocol and Fine-Tuning Regime

UnitSpeech's training consists of two major stages: supervised pre-training and unsupervised adaptation.

  • Pre-training: On transcribed data, the text encoder and diffusion decoder are jointly optimized to minimize

Lpre=Lgrad(cy,eS)+αLenc(cy,X0)L_{\mathrm{pre}} = L_{\mathrm{grad}}(c_y, e_S) + \alpha\,L_{\mathrm{enc}}(c_y, X_0)

learning a mapping between phoneme content and acoustic realization.

  • Unit encoder training: The text encoder and diffusion decoder are frozen. The UnitEncoder is integrated and trained alone, using only untranscribed data to minimize

Lunit=Lgrad(cu,eS)+αLenc(cu,X0)L_{\mathrm{unit}} = L_{\mathrm{grad}}(c_u, e_S) + \alpha\,L_{\mathrm{enc}}(c_u, X_0)

ensuring that discrete units map to the shared content space.

  • Speaker adaptation: For TPST, given a reference pair u,du\langle u', d_{u'}\rangle and its mel X0X_0', only the decoder sθs_\theta is fine-tuned (typically M=500M=500 steps, learning rate 2×1052\times10^{-5}) to minimize

Ladapt=Et,ε[λtsθ(Xt,t  cu,eS)+ε22]L_{\mathrm{adapt}} = \mathbb{E}_{t,\varepsilon}[\|\sqrt{\lambda_t}\,s_\theta(X_t,t\,|\;c_{u'}, e_S')+\varepsilon\|_2^2]

The UnitEncoder remains fixed, and no transcript is needed.

Further, classifier-free guidance is applied during generation to sharpen pronunciation by interpolating between the conditional and unconditional model outputs: s^(Xt,tc,eS)=sθ(Xt,tc,eS)+γ[sθ(Xt,tc,eS)sθ(Xt,tc,eS)]\hat s(X_t,t\mid c,e_S) = s_\theta(X_t,t\mid c,e_S) + \gamma [s_\theta(X_t,t\mid c,e_S)-s_\theta(X_t,t\mid c_{\varnothing},e_S)] with typical γ\gamma values of $1.0$ (TTS) and $1.5$ (VC).

4. Transcript Preservation via Discrete Units

TPST’s foundation is the conversion of speech to discrete, speaker-invariant units uu that function as strictly linguistic pseudo-transcripts. As these units derive from self-supervised HuBERT features, essential phonetic distinctions are preserved (e.g., vowels, consonants, word boundaries). The diffusion decoder is always conditioned on the unit sequence during adaptation and inference, ensuring the generated output reproduces the lexical content of the original utterance. Content fidelity is objectively validated via Character Error Rate (CER), computed using a CTC-Conformer ASR. UnitSpeech achieves a CER of 1.75%\sim1.75\% on personalized TTS and 3.55%\sim3.55\% for any-to-any voice conversion, indicating high transcript fidelity compared to baselines.

5. Empirical Performance and Ablation

Comprehensive experimental results demonstrate the efficacy of TPST within UnitSpeech:

Task MOS (5-scale) CER (%) Speaker MOS Cosine Sim./SECS
Personalized TTS 4.13 ± 0.10 1.75 3.90 ± 0.13 0.935
Any-to-Any VC 4.26 ± 0.09 3.55 3.83 ± 0.13 0.923
  • For Personalized TTS (LibriTTS unseen speakers), performance is comparable to Guided-TTS 2 (MOS 4.16, CER 0.84%) and outperforms YourTTS.
  • For VC, UnitSpeech surpasses DiffVC, YourTTS, and BNE-PPG-VC in naturalness and speaker similarity.

Ablation results establish key factors:

  • Optimum number of units for VC: K=200K=200; TTS is robust to unit count.
  • Fine-tuning duration: Speaker similarity saturates at 500\sim500 steps; overfitting degrades CER beyond 2000\sim2000 steps.
  • Reference utterance duration: Even 5 seconds of untranscribed speech suffice for strong adaptation (CER 1.96%\sim1.96\%, SECS 0.92\sim0.92).
  • Raising guidance scale γ\gamma lowers CER but slightly impacts speaker similarity; optimum γ=1.0\gamma=1.0 (TTS), $1.5$ (VC).

6. Limitations and Extensions

The primary limitation of TPST in UnitSpeech lies in the slightly higher CER for VC relative to TTS (3–4% vs. 1\sim12%2\%), attributable to residual mismatches in the unit clustering with respect to fine-grained phonetic structure. Adaptation to exotic or highly accented voices may require increasing KK (number of units) or supplying more adaptation data. Plausible extensions include the joint fine-tuning of a prosody encoder, addition of tokens for emotion or speaking style, or extension to multilingual TPST via language-universal units.

7. Practical Implications and Significance

TPST in UnitSpeech enables the adaptation of a single diffusion-TTS model to new speakers with only one untranscribed reference utterance, without necessitating transcripts or retraining for every task. The resulting system supports both TTS and VC in an open-set, “any-to-any” fashion. This represents an efficient and flexible approach to personalized speech synthesis, guaranteeing low transcript error rates and seamless speaker identity transfer for applications across languages, voice conversion, and customization with minimal supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Transcript-Preserving Speaker Transfer (TPST).