Papers
Topics
Authors
Recent
2000 character limit reached

GenTSE: Hierarchical Generative TSE Model

Updated 25 December 2025
  • GenTSE is a two-stage decoder-only generative model for target speaker extraction that explicitly disentangles semantic and acoustic features.
  • It employs a hierarchical coarse-to-fine strategy leveraging continuous speech SSL and codec embeddings to enhance generation stability and perceptual quality.
  • The model integrates Frozen-LM Conditioning and Direct Preference Optimization to mitigate exposure bias and align outputs with human perceptual preferences, setting new performance benchmarks on Libri2Mix.

GenTSE is a two-stage, decoder-only, generative LLM specifically designed for target speaker extraction (TSE) from speech mixtures. It introduces a hierarchical coarse-to-fine strategy in which semantic and acoustic structures are explicitly disentangled, leveraging continuous speech self-supervised learning (SSL) and codec embeddings for improved generation stability and perceptual quality. GenTSE integrates novel training strategies—Frozen-LM Conditioning (FLC) and Direct Preference Optimization (DPO)—to address exposure bias and align model outputs with human perceptual preferences, respectively. Empirical results on Libri2Mix demonstrate state-of-the-art performance relative to previous LM-based and discriminative TSE baselines, setting new benchmarks for quality, intelligibility, and speaker consistency (Li et al., 24 Dec 2025).

1. Hierarchical Architecture and Joint Generation Formulation

GenTSE factorizes the conditional target speech generation process into two sequential, autoregressive, decoder-only transformer stages: a semantic stage producing discrete semantic tokens, and an acoustic stage generating fine-resolution codec tokens. The model is conditioned on both reference and mixture speech inputs, both represented via rich, frame-level continuous embeddings.

Given X=(xref,xmix)X=(x_{\rm ref},x_{\rm mix}) (reference and mixture waveforms), GenTSE explicitly models: P(S,AX)=t=1TPθ(sts<t,Es(xref),Es(xmix))Semantic Stage: tokens S=(s1,,sT)×n=1NPϕ(ana<n,S,Ea(xref),Ea(xmix))Acoustic Stage: tokens A=(a1,,aN)P(S,A\mid X) = \underbrace{\prod_{t=1}^{T}P_\theta(s_t \mid s_{<t}, E_s(x_{\rm ref}), E_s(x_{\rm mix}))}_{\text{Semantic Stage: tokens } S=(s_1,\dots,s_T)} \times \underbrace{\prod_{n=1}^{N}P_\phi(a_n \mid a_{<n}, S, E_a(x_{\rm ref}), E_a(x_{\rm mix}))}_{\text{Acoustic Stage: tokens } A=(a_1,\dots,a_N)} where θ\theta and ϕ\phi denote the transformer parameters at semantic and acoustic stages, respectively.

Stage-1: Semantic Extraction

Inputs are continuous SSL embeddings Es(x)=WavLM(x)RT×HE_s(x) = \mathrm{WavLM}(x) \in \mathbb{R}^{T \times H} (from the 6th layer of WavLM). The model autoregressively produces TT discrete semantic tokens Sˉ=[sˉ1,...,sˉT]\bar S = [\bar s_1, ..., \bar s_T].

Stage-2: Acoustic Generation

Inputs comprise continuous codec embeddings Ea(x)=DAC(x)RT×HE_a(x) = \mathrm{DAC}(x) \in \mathbb{R}^{T \times H} (from a DAC encoder), together with semantic tokens SS. The output sequence Aˉ=[aˉ1,...,aˉN]\bar A = [\bar a_1, ..., \bar a_N] consists of fine-grained codec tokens produced by a single-codebook neural audio codec (“SimCodec”).

This hierarchical decomposition stabilizes generation and ensures semantic alignment between extracted speech and the reference (Li et al., 24 Dec 2025).

2. Input Representations and Conditioning Mechanisms

GenTSE employs continuous, high-dimensional embeddings for conditioning at both stages, bypassing the need for projection layers:

  • Semantic embeddings Es(x)E_s(x): Framewise outputs from WavLM (layer 6), providing robust context signals for semantic content.
  • Acoustic embeddings Ea(x)E_a(x): Framewise DAC encoder outputs, capturing low-level acoustic properties.
  • The decoder-only transformers attend directly to these embeddings, integrating information from both the reference utterance and the input mixture.

This design delivers a richer context than previous discretized prompt methods or projection-based schemes and underpins the model’s ability to capture speaker- and content-specific characteristics (Li et al., 24 Dec 2025).

3. Training Methodology: Reducing Exposure Bias and Optimizing Perceptual Quality

3.1 Baseline Training with Teacher Forcing

Cross-entropy losses are computed separately at each stage under teacher forcing: LCE,sem=t=1TlogPθ(sts<t,Es(xref),Es(xmix))\mathcal{L}_{\rm CE,sem} = -\sum_{t=1}^T \log P_\theta(s_t|s_{<t}, E_s(x_{\rm ref}), E_s(x_{\rm mix}))

LCE,aco=n=1NlogPϕ(ana<n,S,Ea(xref),Ea(xmix))\mathcal{L}_{\rm CE,aco} = -\sum_{n=1}^N \log P_\phi(a_n|a_{<n}, S, E_a(x_{\rm ref}), E_a(x_{\rm mix}))

3.2 Frozen-LM Conditioning (FLC) for Exposure Bias Mitigation

FLC addresses the mismatch between teacher-forcing during training and autoregressive inference (exposure bias):

  1. Train baseline models θ\theta, ϕ\phi under teacher forcing.
  2. Clone parameters to θθ\theta' \leftarrow \theta, ϕϕ\phi' \leftarrow \phi and freeze the originals.
  3. Generate token histories (Sˉ\bar S, Aˉ\bar A) using frozen models.
  4. Train the clones (θ\theta', ϕ\phi') on their own predictions:

LCE,sem=t=1TlogPθ(sˉtsˉ<t,Es(xref),Es(xmix))\mathcal{L}'_{\rm CE,sem} = -\sum_{t=1}^T \log P_{\theta'}(\bar s'_t|\bar s_{<t}, E_s(x_{\rm ref}), E_s(x_{\rm mix}))

LCE,aco=n=1NlogPϕ(aˉnaˉ<n,S,Ea(xref),Ea(xmix))\mathcal{L}'_{\rm CE,aco} = -\sum_{n=1}^N \log P_{\phi'}(\bar a'_n|\bar a_{<n}, S, E_a(x_{\rm ref}), E_a(x_{\rm mix}))

3.3 Direct Preference Optimization (DPO) for Human-Aligned Generation

To directly improve perceptual quality, GenTSE applies DPO to fine-tune the acoustic LM (ψ\psi):

  • For context y=[S,Ea(xref),Ea(xmix)]y=[S, E_a(x_{\rm ref}), E_a(x_{\rm mix})], top-kk samples (MM total) are scored by a pretrained UTMOS MOS predictor.
  • For each (A+,A)(A^+, A^-) pair (A+A^+ preferred), DPO loss is:

LDPO=E(A+,A)[logσ(βlogπψ(A+y)πref(Ay)πref(A+y)πψ(Ay))]\mathcal{L}_{\rm DPO} = -\mathbb{E}_{(A^+,A^-)} \left[ \log \sigma \left( \beta \log \frac{\pi_\psi(A^+ | y)\, \pi_{\rm ref}(A^- | y)}{\pi_{\rm ref}(A^+ | y)\, \pi_\psi(A^- | y)} \right) \right]

where πψ\pi_{\psi} is the LM under optimization, πref\pi_{\rm ref} is a frozen reference, and β\beta modulates sharpness.

This preference-integrated training directly steers generation towards human-valued speech characteristics (Li et al., 24 Dec 2025).

4. Decoding and Inference Procedure

The hierarchical decoding pipeline comprises:

  • Semantic Decoding: Greedy or beam search to select semantic token sequences Sˉ\bar S from PθP_\theta.
  • Acoustic Decoding: Autoregressive top-kk sampling (k=16k=16) of Aˉ\bar A from PψP_\psi; M=32M=32 token sequences produced.
  • Candidate Selection: For non-DPO models, best candidate is chosen by UTMOS score; for DPO-finetuned models, the single best sample is used.
  • Reconstruction: Final selected sequence Aˉ\bar A is decoded into waveform x^(t)\hat x(t) via SimCodec.

This modular, sampling-based approach in both stages enables faithful, high-quality speech synthesis while preserving computational efficiency (Li et al., 24 Dec 2025).

5. Experimental Results and Quantitative Comparison

GenTSE’s performance is evaluated on Libri2Mix (clean) against both discriminative (X-TF-GridNet, USEF-SepFormer) and generative LM-based (TSELM-L, LLaSE-G1, Metis) baselines. Evaluation encompasses speech quality, intelligibility, and speaker consistency using DNSMOS (SIG, BAK, OVRL), UTMOS, NISQA, dWER (Whisper-base), SpeechBERT, and SECS (Resemblyzer cosine similarity):

Model SIG BAK OVRL UTMOS NISQA SECS dWER↓ SpeechBERT
GenTSE 3.656 4.135 3.399 4.296 3.976 0.928 0.177 0.920
Metis (G) 3.588 3.980 3.265 3.882 3.869 0.879 0.180 0.890
LLaSE-G1 (G) 3.531 4.015 3.226 3.228 3.638 0.839 0.476 0.825
TSELM-L (G) 3.478 4.035 3.198 3.556 3.509 0.651 0.263 0.832
USEF-SepFormer (D) 3.324 3.698 2.927 3.492 2.880 0.806 0.156 0.830
Mixture 3.383 3.098 2.653 1.519 2.251 0.754 0.821 0.655

GenTSE achieves superior overall performance, with improvements over Metis in OVRL (+0.134), SECS (+0.049), and dWER (\approx0.177), signifying advances in both linguistic and speaker-specific fidelity (Li et al., 24 Dec 2025).

6. Qualitative Insights, Ablations, and Methodological Contributions

Hierarchical Design: Ablation studies removing the semantic stage result in higher dWER (0.284→0.217), empirically validating the coarse-to-fine arrangement.

Frozen-LM Conditioning: Training with FLC closes the accuracy gap between teacher forcing and autoregressive inference, indicating effective mitigation of exposure bias (see Fig. 2 in (Li et al., 24 Dec 2025)).

Perceptual Alignment via DPO: DPO fine-tuning increases UTMOS by 0.259\approx0.259 (4.384 vs. 4.125) after 400 steps, aligning outputs with human judgments. This improvement comes with minimal trade-off in log-likelihood-based metrics.

Integrated Generative Pipeline: The unified LM-based strategy—leveraging continuous conditioning, FLC, and DPO—surpasses previous LM-based TSE models without introducing extra inference-time modules, yielding more natural, intelligible, and speaker-consistent extracted speech (Li et al., 24 Dec 2025).

A plausible implication is that the hierarchical, preference-aligned generative modeling paradigm found in GenTSE may generalize to related sequence-generation tasks within speech and multimodal domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GenTSE.