LauraTSE: Target Speaker Extraction

Updated 13 January 2026

LauraTSE is a generative framework for target speaker extraction that employs auto-regressive decoders and neural audio codec representations.
The system integrates a Conformer-based feature encoder and an encoder-only vocoder LM to reconstruct high-quality speech from mixed inputs.
A two-stage discriminative–generative pipeline refines coarse estimates, balancing perceptual enhancement with semantic fidelity as validated on Libri2Mix metrics.

LauraTSE is a generative framework for target speaker extraction (TSE), employing an auto-regressive (@@@@10@@@@) decoder-only LLM in conjunction with neural audio codec representations. It is designed for the separation and high-fidelity reconstruction of speech from a specific target speaker, given a short enrollment utterance of that speaker, from a mixed audio input. In its most advanced configuration, LauraTSE is integrated into a two-stage discriminative–generative pipeline (USEF-Laura-TSE), which leverages the control and robustness of a discriminative front-end with the perceptual quality enhancement capabilities of a generative back-end (Zeng et al., 9 Jan 2026, Tang et al., 10 Apr 2025).

1. System Architecture

The core LauraTSE system consists of three main components:

(a) Feature Encoder (Conformer):

A Conformer-based encoder processes both the enrollment utterance ( $r$ ) and the mixed speech input ( $m$ ), generating continuous embeddings $E_r$ and $E_m$ from log-mel spectrogram inputs:

$E_m = \mathcal{C}(M_m),\; E_r = \mathcal{C}(M_r).$

(b) AR Decoder-Only LLM (Decoder-only LM):

Conditioned on $[bos; E_r; sep; E_m; tse]$ , this component autoregressively predicts the discrete tokens for the first $n$ residual vector quantization (RVQ) codec layers of the target speech:

$P_\theta(\hat{D}_n | E_m, E_r) = \prod_{i=1}^T P_\theta(\hat{D}_n^{(i)} | \hat{D}_n^{(1:i-1)}, E_m, E_r).$

(c) Encoder-Only Vocoder LLM (Vocoder LM):

This one-step Transformer takes $[E_r, E_m, \hat{D}_n]$ and outputs a fine-grained embedding $\hat{E}_s$ representing the sum of all RVQ layers, which is then passed through a frozen neural audio codec decoder to produce the final waveform.

Two-Stage Discriminative–Generative Extension:

In the USEF-Laura-TSE framework, a discriminative front-end (USEF-TFGridNet) first extracts a coarse target speech estimate $D_o = \mathcal{D}(m, r)$ through a sequence of operations: STFT, 2-D convolutional encoding, cross multi-head attention, concatenation, TF-GridNet processing, and decoding (transposed-conv + iSTFT). The generative LauraTSE back-end ( $\mathcal{G}$ ) then refines this intermediate representation:

$G_o = \mathcal{G}(D_o),$

treating $D_o$ as the conditional context in place of $E_m, E_r$ to reconstruct the final high-fidelity waveform (Zeng et al., 9 Jan 2026).

2. Mathematical Formulations and Training Losses

Generative Module

AR LM (Decoder-only):

Standard cross-entropy loss over predicted codec tokens at each time step and across $n$ layers:

$\mathcal{L}_{AR} = - \sum_{t=1}^T \sum_{l=1}^n \log P_\theta(\mathrm{token}_{\mathrm{GT}}^{(t,l)} \mid \mathrm{token}^{(<t,l)}, E_m, E_r)$

Encoder-Only Vocoder LM:

Supervised with ground-truth summed embeddings, using an $L_1 + L_2$ loss:

$\mathcal{L}_{voc} = \| \hat{E}_s - E_s \|_1 + \| \hat{E}_s - E_s \|_2^2$

Discriminative Front-End

Initially trained with complex-spectral MSE; during joint training, optionally includes an SI-SDR loss:

$\mathrm{SI\text{-}SDR} = -10 \log_{10}\left( \frac{\|s_{proj}\|^2}{\|s_{err}\|^2} \right)$

where $s_{proj} = \frac{\langle \hat{S}, S \rangle}{\|S\|^2} S$ and $s_{err} = \hat{S} - s_{proj}$ .

Combined Objective

For joint training with an unfrozen front-end:

$\mathcal{L}_{total} = \mathcal{L}_{gen} + \lambda_{SI\text{-}SDR} \cdot \mathcal{L}_{SI\text{-}SDR}$

where $\mathcal{L}_{gen}$ aggregates the AR and vocoder losses.

3. Inference Procedures and Collaborative Training

Pipeline:

Extract log-mel spectrograms for $r$ and $m$ .
Encode with the shared Conformer to obtain $E_r, E_m$ .
Run the AR LM for coarse token prediction.
Map each token to its embedding, sum across layers for $\hat{D}_n(t)$ .
Invoke the encoder-only LM to compute $\hat{E}_s$ .
Decode via neural audio codec to recover the time-domain waveform.

Two-Stage Mode:

At inference, the AR LM’s free-running output may be partially or completely substituted with the discriminative front-end’s own codec-encoded tokens at an injection ratio $R \in [0, 1]$ , trading between fully autoregressive generation $(R=0)$ and total reliance on front-end pseudo-labels $(R=1)$ .

Front-End Training Variation:
- Freeze: Discriminative module weights fixed; generative back-end learns to refine the fixed output.
- Unfreeze: Gradients flow from generative losses into the discrimination module, enhancing consistency at the cost of more complex balancing, mediated through $\lambda_{SI\text{-}SDR}$ .

4. Experimental Results and Evaluation Metrics

Evaluation has been conducted primarily on the Libri2Mix clean test set, using the following metrics:

DNSMOS (SIG, BAK, OVRL): Perceptual quality and noise suppression.
NISQA: Overall non-intrusive speech quality assessment.
SpeechBERT Score: Semantic consistency in self-supervised embedding space.
dWER: Differential word error rate via Whisper ASR (semantic intelligibility).
Speaker Similarity: Cosine similarity in WavLM-SV and WeSpeaker spaces.

Comparative results:

The purely generative LauraTSE achieves OVRL ≈ 3.34, NISQA ≈ 4.33, dWER ≈ 0.159, speaker sim ≈ 0.97 (Zeng et al., 9 Jan 2026), showing high perceptual quality but weaker intelligibility and fidelity compared to discriminative models.
The discriminative baseline (USEF-TFGridNet-L) has dWER ≈ 0.075 and high speaker similarity (>0.98) but lower OVRL ≈ 3.27, NISQA ≈ 4.32.
The two-stage USEF-Laura-TSE-L (front-end unfrozen, SI-SDR loss, multiset training) achieves OVRL ≈ 3.32, NISQA ≈ 4.45, dWER ≈ 0.117, speaker sim ≈ 0.98, representing a balance between semantic consistency and perceptual enhancement.

Model	OVRL	NISQA	dWER	Speaker Sim
USEF-TFGridNet-L (D)	3.27	4.32	0.075	>0.98
LauraTSE (AR+vocoder) (G)	3.34	4.33	0.159	0.97
USEF-Laura-TSE-L (D+G)	3.32	4.45	0.117	0.98

D: Discriminative, G: Generative, D+G: Discriminative–Generative two-stage.

5. Ablation Studies and Architectural Insights

RVQ Layer Depth ( $n_q$ ):

Varying $n_q$ from 1 to 3 in the AR LM results in minimal differences in SIG/OVR metrics (≤0.03), indicating that even minimal coarse content suffices for downstream performance (Tang et al., 10 Apr 2025).

Input Modality:

Discrete I/O, where Stage I operates purely on tokens rather than continuous embeddings, underperforms continuous-feature input modes. Replacing the Conformer with a frozen ASR Conformer or WavLM embeddings degrades speaker-related scores.

Data Efficiency:

LauraTSE trained on 460 h LibriSpeech matches or outperforms other generative models trained on 5000 h multi-task data (e.g., AnyEnhance) in speaker and semantic similarity metrics.

Training Regime:

In joint training, fine-tuning the discriminative front-end (as opposed to freezing) and applying SI-SDR loss with a tuned $\lambda_{SI\text{-}SDR}$ improves semantic fidelity without perceptual degradation.

6. Limitations, Extensions, and Outlook

While LauraTSE establishes strong efficiency in single-task settings and shows effective performance with coarse-to-fine generation, several limitations are reported:

Absence of explicit large-scale data scaling studies; current findings suggest data efficiency, but further quantification is needed.
The two-stage coarse-to-fine structure increases system complexity and latency, suggesting a potential benefit for unified end-to-end sequence modeling.
The neural codec is trained on clean speech only; adapting or jointly training the codec in noisy or mixture conditions may yield further robustness.
Extending the backbone to multimodal or multi-task setups, such as noise suppression or dereverberation, is highlighted as a future direction (Tang et al., 10 Apr 2025).

References

"Discriminative-Generative Target Speaker Extraction with Decoder-Only LLMs" (Zeng et al., 9 Jan 2026).
"LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only LLMs" (Tang et al., 10 Apr 2025).

Markdown Upgrade to Chat

References (2)

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models (2026)

LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LauraTSE.