Transcript-Preserving Speaker Transfer

Updated 9 November 2025

TPST is defined by its strict invariance constraint, ensuring the original transcript remains unchanged while only speaker labels are altered.
It employs diverse methodologies including LLM-based diarization correction, neural TTS adaptation, and concatenative synthesis for robust speaker transfer.
Empirical results demonstrate significant improvements in metrics like cpWER, MOS, and EER across various applications such as speech synthesis and speaker verification.

Transcript-Preserving Speaker Transfer (TPST) refers to a family of algorithms and training paradigms for systematically transferring the speaker identity of an audio segment or transcript while strictly preserving the sequence of discrete content units—which may be words, phones, or self-supervised tokens. This approach is motivated by diverse applications including speaker-adaptive speech synthesis, speaker verification, data augmentation, voice conversion, and post-processed automatic speech recognition (ASR) diarization. Across tasks, TPST enforces an invariance constraint: the output transcript or speech maintains identical content to the input, and only the attribute of speaker identity is modified or reassigned. Methodological variants are instantiated in speaker-adaptive speech synthesis via neural models that operate on either text or untranscribed data with pseudo-transcripts, in concatenative data augmentation, and in LLM-based label relabeling for diarization correction.

1. Formal Problem Statement

In its general form, TPST operates on an input consisting of content units and associated (potentially incorrect or to-be-altered) speaker labels. The goal is to produce an output in which the content units are preserved verbatim and only the speaker identity mapping is changed, maximally aligning with a target speaker, set of speaker labels, or distribution over speakers.

A precise formulation in the context of speaker diarization correction (Efstathiadis et al., 7 Jun 2024) is as follows:

Input: An ASR transcript $T_{\text{ASR}} = [(w_1, s_1), ..., (w_N, s_N)]$ , with each word $w_i$ assigned label $s_i \in \{ \text{spk}_A, \text{spk}_B \}$ .
Reference: $T_{\text{ref}}$ with correct speaker labels (segmentation may differ).
Output: $T_{\text{corr}} = [(w_1, s_1'), ..., (w_N, s_N')]$ where $\{ s_i' \}$ are relabeled to agree as closely as possible with $T_{\text{ref}}$ while preserving $\{ w_i \}$ .

A critical constraint is that the output sequence $W = (w_1, ..., w_N)$ (or the equivalent phonetic/unit sequence) remains strictly unaltered.

2. Algorithmic Family and Exemplars

a) LLM-based Diarization Correction

In (Efstathiadis et al., 7 Jun 2024), TPST is operationalized as an alignment and correction workflow centered around LLM fine-tuning:

Align the ASR transcript to the reference transcript using word-level dynamic programming (e.g., longest common subsequence).
Construct oracle pairs mapping each word in $T_{\text{ASR}}$ to its true label, forming a training triplet $(\text{ASR}, \text{oracle labels})$ .
Fine-tune an LLM (Mistral-7B-Instruct, QLoRA adapters, FlashAttention) for each ASR condition, with the prompt being the original transcript and the completion the oracle-labeled transcript.
At inference, prompt the LLM chunked ASR input and parse only the speaker label stream, ensuring no alteration to word content.
For generalization, merge the adapters from various ASR-specific models with TIES-Merging to form an ensemble with coefficients $(0.34, 0.33, 0.33)$ , thus yielding an ASR-agnostic correction model.

b) Neural Speech Synthesis Adaptation (Transcribed/Untranscribed)

(Luong et al., 2019) realizes TPST in a modular neural text-to-speech system configured for both transcribed and untranscribed speaker adaptation:

The model splits into a speaker-independent linguistic encoder, an auxiliary acoustic encoder (generating pseudo-transcripts from mel-spectrograms), and a speaker-adaptive acoustic decoder.
During adaptation, the system is fine-tuned using one of two loss regimes:
- Supervised: Backpropagate loss between decoder output and ground-truth using the linguistic encoder as input.
- Unsupervised: Backpropagate loss using only acoustic input through the auxiliary encoder, enabling adaptation without transcripts.
The decoder's speaker-specific parameters (codes or full weights) are updated to minimize $L_{\text{sup}}$ or $L_{\text{unsup}}$ , ensuring the content features (linguistic or pseudo-transcript latent) are mapped to the target speaker's voice.

c) Self-Supervised Unit-Based TPST in Diffusion Models

(Kim et al., 2023) implements TPST in UnitSpeech by introducing a HuBERT- and KMeans-based unit encoder:

A pretrained diffusion TTS (Grad-TTS) is extended to accept either text or unit sequence representations as conditional input.
The unit encoder processes raw speech into discrete tokens and durations, producing a speaker-independent content embedding.
For adaptation, a single untranscribed utterance is used to derive new speaker embeddings and content units; only the decoder is fine-tuned for ~500 steps while encoders are frozen.
At synthesis, feeding new content (text or units) with the adapted speaker embedding produces transcript-preserving, speaker-transferred speech.

d) Concatenative Synthesis for Data Augmentation

In (Huang et al., 2021), a non-neural, segmental unit-selection synthesis algorithm is used:

For each speaker, text-independent data is force-aligned to phones. Segments containing exactly one phone are extracted.
To synthesize a fixed phrase (e.g., for TD-SV), the required phone sequence is mapped from the phrase and for each phone, a random segment is drawn from the speaker's pool.
Waveform segments are concatenated in order without join optimization, constructing "synthetic" fixed-phrase utterances that preserve transcript and speaker identity by construction.

3. Mathematical Formulation and Pseudocode Examples

Alignment in LLM-based TPST

Given an oracle reference $(w_1, \sigma_1), ..., (w_M, \sigma_M)$ and ASR transcript $(w_1, s_1), ..., (w_N, s_N)$ , define

$\mathrm{TPST}: \left[ (w_i, \sigma_i)_{i=1}^M, (w_j, s_j)_{j=1}^N \right] \longrightarrow (w_j, \sigma_j')_{j=1}^N$

where alignment is performed via dynamic programming. For each $j$ , find the best-matched reference word $i^*$ and set $\sigma'_j = \sigma_{i^*}$ .

Adaptation Loss in Neural Synthesis

For untranscribed adaptation (Luong et al., 2019):

$L_{\text{unsup}} = \| f_{\text{dec}}(AUX(y); \theta_{\text{core}}, \theta_{\text{spk}}) - y \|^2$

Update $\theta_{\text{spk}}$ (and optionally $\theta_{\text{core}}$ ) by minimizing $L_{\text{unsup}}$ with Adam.

TPST in Diffusion TTS (UnitSpeech)

During adaptation:

$L_{\text{adapt}} = \mathbb{E}_{t, X_0, \epsilon} \left[ \left \| \sqrt{\lambda_t} \, s_\theta(X_t, t \mid c_{u'}, e_S) + \epsilon \right\|^2 \right]$

where $X_t = \sqrt{1 - \lambda_t} X_0 + \sqrt{\lambda_t} \epsilon$ , with $X_0$ the mel-spectrogram of the reference, $c_{u'}$ the content embedding from pseudo-units, and $e_S$ the speaker embedding.

4. Evaluation Metrics and Empirical Results

Metrics for assessing TPST methods vary by application:

Task	Metric(s)	Highlights
Diarization corr.	cpWER, SA-WER ( $\Delta$ CP, $\Delta$ SA)	Ensemble $\Delta$ CP: −32% to −57%; SA: −14% to −39% lower than baseline (Efstathiadis et al., 7 Jun 2024)
Adaptive speech	MOS, CER, SMOS, SECS	MOS 4.13–4.26, SMOS 3.83–3.90, SECS >0.92 (Kim et al., 2023)
Speaker verif.	EER, minDCF	Augmentation: EER down 40–60%, further 20–30% gains with synthesized + real data (Huang et al., 2021)

For neural TTS adaptation, subjective MOS and similarity ratings are reported (e.g., UnitSpeech: MOS = 4.13 ± 0.10; SMOS = 3.90 ± 0.13).

In diarization correction, speaker-attributed and concatenated-permutation WER quantify alignment between hypothesized and reference speaker labels. For LLM-based correction, ensemble models outperform individual ASR-specific models in ASR-agnostic settings, showing up to 57% relative reduction in $\Delta$ CP (Efstathiadis et al., 7 Jun 2024).

5. Constraints, Limitations, and Extensions

TPST implementations are governed by several domain and methodological factors:

Transcript invariance is explicit and strict; no model or synthesis phase alters word/phone/unit identity or sequence.
In LLM-based workflows, parsing completions to enforce immutability of the content sequence is essential (Efstathiadis et al., 7 Jun 2024).
Neural adaptation models depend on latent disentanglement of speaker and content, and only the speaker components are permitted to adapt at transfer time.
Concatenative approaches guarantee preservation structurally, but offer no mechanism for acoustic smoothing or seamlessness, and no join cost or re-ranking machinery is deployed (Huang et al., 2021).

Empirical evaluations reveal:

ASR-to-ASR domain shift poses challenges for diarization correction; merging strategies such as TIES-Merging moderately alleviate the specificity issue.
Unsupervised (unit/pseudo-transcript-based) adaptation is as effective as, or superior to, supervised when sufficient data is available (Luong et al., 2019, Kim et al., 2023).
In concatenative augmentation, transcript coverage limits the speakers and phrases that can be synthesized; any missing phone in the target phrase prevents synthesis for that speaker.
Cross-domain generalization, scaling to more than two speakers, and handling of diverse languages or prosodic/expressive variation remain underexplored or untested in these frameworks.

Proposed extensions include incorporation of contextual task metadata, multimodal acoustic embeddings, permutation-aware loss structures for multi-speaker generalization, and adversarial or flow-based modeling to enrich latent invariance and output fidelity.

6. Summary and Implications

Transcript-Preserving Speaker Transfer constitutes a principled constraint for data augmentation, adaptive synthesis, post-hoc correction, and identity-controlled generative modeling in speech and transcript-domain tasks. Methods span from data-driven segmental concatenation (Huang et al., 2021) to modern neural adaptation with or without transcripts (Luong et al., 2019, Kim et al., 2023), and LLM-based label sequence correction (Efstathiadis et al., 7 Jun 2024). Empirical evidence demonstrates that TPST can deliver substantial improvements in both speaker-dependent and speaker-independent settings, with competitive or superior metrics relative to baselines that do not enforce transcript preservation. The invariance constraint, while restrictive, confers both interpretability and robustness, and is increasingly operationalized in ASR, TTS, and speaker analysis pipelines. Future research is needed on generalization to complex, multi-party, and multi-style environments, as well as more expressive forms of TPST leveraging enriched latent content or cross-modal adaptation.