Transcript-Preserving Speaker Transfer (TPST)
- TPST is a method that converts speech into speaker-invariant discrete units to preserve linguistic content while transferring speaker identity.
- It integrates a UnitEncoder and a diffusion-based TTS backbone to enable both text-to-speech and any-to-any voice conversion without explicit transcripts.
- Empirical results demonstrate high transcript fidelity and competitive speaker similarity, though slightly higher errors are observed for voice conversion.
Transcript-Preserving Speaker Transfer (TPST) is a technique that enables the adaptation of synthetic speech to arbitrary new speakers using minimal, untranscribed adaptation data, while strictly maintaining the original linguistic transcript. In the UnitSpeech framework, TPST is accomplished through the integration of self-supervised, speaker-invariant discrete units as pseudo-transcripts into a diffusion-based text-to-speech (TTS) system. The method allows both text-to-speech and any-to-any voice conversion without explicit transcripts or extensive retraining, ensuring high content preservation and maintaining speaker identity across synthesis tasks (Kim et al., 2023).
1. Self-Supervised Discrete Unit Representation
TPST in UnitSpeech begins by converting untranscribed source speech into discrete unit sequences using a self-supervised pipeline. Each speech frame is first embedded via a pre-trained HuBERT model, yielding hidden states . These are clustered with -means to produce cluster centroids , and each frame is assigned a unit label: The resulting sequence represents the entire utterance as discrete units, abstracting away speaker characteristics and containing only linguistic and phonetic content. This unit sequence is further processed to match the temporality of the mel-spectrogram frames—upsampling and transforming repeated indices into explicit durations . Both and are fed to a UnitEncoder, which shares the architecture with the TTS text encoder, outputting a continuous embedding sequence . Because the only difference lies in input modality, and the phoneme-based embedding are trained to occupy a shared "content space".
2. Diffusion-Based Synthesis Backbone
Speech synthesis and transfer are performed with a diffusion-based decoder, specifically a multi-speaker Grad-TTS backbone. Given a target mel-spectrogram and a speaker embedding , the forward process follows a continuous-time stochastic differential equation (SDE): The reverse process is realized via discretized Euler–Maruyama sampling. The decoder is trained with a score-matching loss: where may be (text) or (unit), and an additional encoder alignment loss encourages the content embedding to align with the spectrogram and be speaker-independent. During inference, the decoder generates speech conditioned either on phonemes (TTS) or units (VC).
3. Training Protocol and Fine-Tuning Regime
UnitSpeech's training consists of two major stages: supervised pre-training and unsupervised adaptation.
- Pre-training: On transcribed data, the text encoder and diffusion decoder are jointly optimized to minimize
learning a mapping between phoneme content and acoustic realization.
- Unit encoder training: The text encoder and diffusion decoder are frozen. The UnitEncoder is integrated and trained alone, using only untranscribed data to minimize
ensuring that discrete units map to the shared content space.
- Speaker adaptation: For TPST, given a reference pair and its mel , only the decoder is fine-tuned (typically steps, learning rate ) to minimize
The UnitEncoder remains fixed, and no transcript is needed.
Further, classifier-free guidance is applied during generation to sharpen pronunciation by interpolating between the conditional and unconditional model outputs: with typical values of $1.0$ (TTS) and $1.5$ (VC).
4. Transcript Preservation via Discrete Units
TPST’s foundation is the conversion of speech to discrete, speaker-invariant units that function as strictly linguistic pseudo-transcripts. As these units derive from self-supervised HuBERT features, essential phonetic distinctions are preserved (e.g., vowels, consonants, word boundaries). The diffusion decoder is always conditioned on the unit sequence during adaptation and inference, ensuring the generated output reproduces the lexical content of the original utterance. Content fidelity is objectively validated via Character Error Rate (CER), computed using a CTC-Conformer ASR. UnitSpeech achieves a CER of on personalized TTS and for any-to-any voice conversion, indicating high transcript fidelity compared to baselines.
5. Empirical Performance and Ablation
Comprehensive experimental results demonstrate the efficacy of TPST within UnitSpeech:
| Task | MOS (5-scale) | CER (%) | Speaker MOS | Cosine Sim./SECS |
|---|---|---|---|---|
| Personalized TTS | 4.13 ± 0.10 | 1.75 | 3.90 ± 0.13 | 0.935 |
| Any-to-Any VC | 4.26 ± 0.09 | 3.55 | 3.83 ± 0.13 | 0.923 |
- For Personalized TTS (LibriTTS unseen speakers), performance is comparable to Guided-TTS 2 (MOS 4.16, CER 0.84%) and outperforms YourTTS.
- For VC, UnitSpeech surpasses DiffVC, YourTTS, and BNE-PPG-VC in naturalness and speaker similarity.
Ablation results establish key factors:
- Optimum number of units for VC: ; TTS is robust to unit count.
- Fine-tuning duration: Speaker similarity saturates at steps; overfitting degrades CER beyond steps.
- Reference utterance duration: Even 5 seconds of untranscribed speech suffice for strong adaptation (CER , SECS ).
- Raising guidance scale lowers CER but slightly impacts speaker similarity; optimum (TTS), $1.5$ (VC).
6. Limitations and Extensions
The primary limitation of TPST in UnitSpeech lies in the slightly higher CER for VC relative to TTS (3–4% vs. –), attributable to residual mismatches in the unit clustering with respect to fine-grained phonetic structure. Adaptation to exotic or highly accented voices may require increasing (number of units) or supplying more adaptation data. Plausible extensions include the joint fine-tuning of a prosody encoder, addition of tokens for emotion or speaking style, or extension to multilingual TPST via language-universal units.
7. Practical Implications and Significance
TPST in UnitSpeech enables the adaptation of a single diffusion-TTS model to new speakers with only one untranscribed reference utterance, without necessitating transcripts or retraining for every task. The resulting system supports both TTS and VC in an open-set, “any-to-any” fashion. This represents an efficient and flexible approach to personalized speech synthesis, guaranteeing low transcript error rates and seamless speaker identity transfer for applications across languages, voice conversion, and customization with minimal supervision.