GenTSE: Hierarchical Generative TSE Model
- GenTSE is a two-stage decoder-only generative model for target speaker extraction that explicitly disentangles semantic and acoustic features.
- It employs a hierarchical coarse-to-fine strategy leveraging continuous speech SSL and codec embeddings to enhance generation stability and perceptual quality.
- The model integrates Frozen-LM Conditioning and Direct Preference Optimization to mitigate exposure bias and align outputs with human perceptual preferences, setting new performance benchmarks on Libri2Mix.
GenTSE is a two-stage, decoder-only, generative LLM specifically designed for target speaker extraction (TSE) from speech mixtures. It introduces a hierarchical coarse-to-fine strategy in which semantic and acoustic structures are explicitly disentangled, leveraging continuous speech self-supervised learning (SSL) and codec embeddings for improved generation stability and perceptual quality. GenTSE integrates novel training strategies—Frozen-LM Conditioning (FLC) and Direct Preference Optimization (DPO)—to address exposure bias and align model outputs with human perceptual preferences, respectively. Empirical results on Libri2Mix demonstrate state-of-the-art performance relative to previous LM-based and discriminative TSE baselines, setting new benchmarks for quality, intelligibility, and speaker consistency (Li et al., 24 Dec 2025).
1. Hierarchical Architecture and Joint Generation Formulation
GenTSE factorizes the conditional target speech generation process into two sequential, autoregressive, decoder-only transformer stages: a semantic stage producing discrete semantic tokens, and an acoustic stage generating fine-resolution codec tokens. The model is conditioned on both reference and mixture speech inputs, both represented via rich, frame-level continuous embeddings.
Given (reference and mixture waveforms), GenTSE explicitly models: where and denote the transformer parameters at semantic and acoustic stages, respectively.
Stage-1: Semantic Extraction
Inputs are continuous SSL embeddings (from the 6th layer of WavLM). The model autoregressively produces discrete semantic tokens .
Stage-2: Acoustic Generation
Inputs comprise continuous codec embeddings (from a DAC encoder), together with semantic tokens . The output sequence consists of fine-grained codec tokens produced by a single-codebook neural audio codec (“SimCodec”).
This hierarchical decomposition stabilizes generation and ensures semantic alignment between extracted speech and the reference (Li et al., 24 Dec 2025).
2. Input Representations and Conditioning Mechanisms
GenTSE employs continuous, high-dimensional embeddings for conditioning at both stages, bypassing the need for projection layers:
- Semantic embeddings : Framewise outputs from WavLM (layer 6), providing robust context signals for semantic content.
- Acoustic embeddings : Framewise DAC encoder outputs, capturing low-level acoustic properties.
- The decoder-only transformers attend directly to these embeddings, integrating information from both the reference utterance and the input mixture.
This design delivers a richer context than previous discretized prompt methods or projection-based schemes and underpins the model’s ability to capture speaker- and content-specific characteristics (Li et al., 24 Dec 2025).
3. Training Methodology: Reducing Exposure Bias and Optimizing Perceptual Quality
3.1 Baseline Training with Teacher Forcing
Cross-entropy losses are computed separately at each stage under teacher forcing:
3.2 Frozen-LM Conditioning (FLC) for Exposure Bias Mitigation
FLC addresses the mismatch between teacher-forcing during training and autoregressive inference (exposure bias):
- Train baseline models , under teacher forcing.
- Clone parameters to , and freeze the originals.
- Generate token histories (, ) using frozen models.
- Train the clones (, ) on their own predictions:
3.3 Direct Preference Optimization (DPO) for Human-Aligned Generation
To directly improve perceptual quality, GenTSE applies DPO to fine-tune the acoustic LM ():
- For context , top- samples ( total) are scored by a pretrained UTMOS MOS predictor.
- For each pair ( preferred), DPO loss is:
where is the LM under optimization, is a frozen reference, and modulates sharpness.
This preference-integrated training directly steers generation towards human-valued speech characteristics (Li et al., 24 Dec 2025).
4. Decoding and Inference Procedure
The hierarchical decoding pipeline comprises:
- Semantic Decoding: Greedy or beam search to select semantic token sequences from .
- Acoustic Decoding: Autoregressive top- sampling () of from ; token sequences produced.
- Candidate Selection: For non-DPO models, best candidate is chosen by UTMOS score; for DPO-finetuned models, the single best sample is used.
- Reconstruction: Final selected sequence is decoded into waveform via SimCodec.
This modular, sampling-based approach in both stages enables faithful, high-quality speech synthesis while preserving computational efficiency (Li et al., 24 Dec 2025).
5. Experimental Results and Quantitative Comparison
GenTSE’s performance is evaluated on Libri2Mix (clean) against both discriminative (X-TF-GridNet, USEF-SepFormer) and generative LM-based (TSELM-L, LLaSE-G1, Metis) baselines. Evaluation encompasses speech quality, intelligibility, and speaker consistency using DNSMOS (SIG, BAK, OVRL), UTMOS, NISQA, dWER (Whisper-base), SpeechBERT, and SECS (Resemblyzer cosine similarity):
| Model | SIG | BAK | OVRL | UTMOS | NISQA | SECS | dWER↓ | SpeechBERT |
|---|---|---|---|---|---|---|---|---|
| GenTSE | 3.656 | 4.135 | 3.399 | 4.296 | 3.976 | 0.928 | 0.177 | 0.920 |
| Metis (G) | 3.588 | 3.980 | 3.265 | 3.882 | 3.869 | 0.879 | 0.180 | 0.890 |
| LLaSE-G1 (G) | 3.531 | 4.015 | 3.226 | 3.228 | 3.638 | 0.839 | 0.476 | 0.825 |
| TSELM-L (G) | 3.478 | 4.035 | 3.198 | 3.556 | 3.509 | 0.651 | 0.263 | 0.832 |
| USEF-SepFormer (D) | 3.324 | 3.698 | 2.927 | 3.492 | 2.880 | 0.806 | 0.156 | 0.830 |
| Mixture | 3.383 | 3.098 | 2.653 | 1.519 | 2.251 | 0.754 | 0.821 | 0.655 |
GenTSE achieves superior overall performance, with improvements over Metis in OVRL (+0.134), SECS (+0.049), and dWER (0.177), signifying advances in both linguistic and speaker-specific fidelity (Li et al., 24 Dec 2025).
6. Qualitative Insights, Ablations, and Methodological Contributions
Hierarchical Design: Ablation studies removing the semantic stage result in higher dWER (0.284→0.217), empirically validating the coarse-to-fine arrangement.
Frozen-LM Conditioning: Training with FLC closes the accuracy gap between teacher forcing and autoregressive inference, indicating effective mitigation of exposure bias (see Fig. 2 in (Li et al., 24 Dec 2025)).
Perceptual Alignment via DPO: DPO fine-tuning increases UTMOS by (4.384 vs. 4.125) after 400 steps, aligning outputs with human judgments. This improvement comes with minimal trade-off in log-likelihood-based metrics.
Integrated Generative Pipeline: The unified LM-based strategy—leveraging continuous conditioning, FLC, and DPO—surpasses previous LM-based TSE models without introducing extra inference-time modules, yielding more natural, intelligible, and speaker-consistent extracted speech (Li et al., 24 Dec 2025).
A plausible implication is that the hierarchical, preference-aligned generative modeling paradigm found in GenTSE may generalize to related sequence-generation tasks within speech and multimodal domains.