LauraTSE: Target Speaker Extraction
- LauraTSE is a generative framework for target speaker extraction that employs auto-regressive decoders and neural audio codec representations.
- The system integrates a Conformer-based feature encoder and an encoder-only vocoder LM to reconstruct high-quality speech from mixed inputs.
- A two-stage discriminative–generative pipeline refines coarse estimates, balancing perceptual enhancement with semantic fidelity as validated on Libri2Mix metrics.
LauraTSE is a generative framework for target speaker extraction (TSE), employing an auto-regressive (@@@@10@@@@) decoder-only LLM in conjunction with neural audio codec representations. It is designed for the separation and high-fidelity reconstruction of speech from a specific target speaker, given a short enrollment utterance of that speaker, from a mixed audio input. In its most advanced configuration, LauraTSE is integrated into a two-stage discriminative–generative pipeline (USEF-Laura-TSE), which leverages the control and robustness of a discriminative front-end with the perceptual quality enhancement capabilities of a generative back-end (Zeng et al., 9 Jan 2026, Tang et al., 10 Apr 2025).
1. System Architecture
The core LauraTSE system consists of three main components:
(a) Feature Encoder (Conformer):
A Conformer-based encoder processes both the enrollment utterance () and the mixed speech input (), generating continuous embeddings and from log-mel spectrogram inputs:
(b) AR Decoder-Only LLM (Decoder-only LM):
Conditioned on , this component autoregressively predicts the discrete tokens for the first residual vector quantization (RVQ) codec layers of the target speech:
(c) Encoder-Only Vocoder LLM (Vocoder LM):
This one-step Transformer takes and outputs a fine-grained embedding representing the sum of all RVQ layers, which is then passed through a frozen neural audio codec decoder to produce the final waveform.
Two-Stage Discriminative–Generative Extension:
In the USEF-Laura-TSE framework, a discriminative front-end (USEF-TFGridNet) first extracts a coarse target speech estimate through a sequence of operations: STFT, 2-D convolutional encoding, cross multi-head attention, concatenation, TF-GridNet processing, and decoding (transposed-conv + iSTFT). The generative LauraTSE back-end () then refines this intermediate representation:
treating as the conditional context in place of to reconstruct the final high-fidelity waveform (Zeng et al., 9 Jan 2026).
2. Mathematical Formulations and Training Losses
Generative Module
- AR LM (Decoder-only):
Standard cross-entropy loss over predicted codec tokens at each time step and across layers:
- Encoder-Only Vocoder LM:
Supervised with ground-truth summed embeddings, using an loss:
Discriminative Front-End
- Initially trained with complex-spectral MSE; during joint training, optionally includes an SI-SDR loss:
where and .
Combined Objective
- For joint training with an unfrozen front-end:
where aggregates the AR and vocoder losses.
3. Inference Procedures and Collaborative Training
- Pipeline:
- Extract log-mel spectrograms for and .
- Encode with the shared Conformer to obtain .
- Run the AR LM for coarse token prediction.
- Map each token to its embedding, sum across layers for .
- Invoke the encoder-only LM to compute .
- Decode via neural audio codec to recover the time-domain waveform.
- Two-Stage Mode:
At inference, the AR LM’s free-running output may be partially or completely substituted with the discriminative front-end’s own codec-encoded tokens at an injection ratio , trading between fully autoregressive generation and total reliance on front-end pseudo-labels .
- Front-End Training Variation:
- Freeze: Discriminative module weights fixed; generative back-end learns to refine the fixed output.
- Unfreeze: Gradients flow from generative losses into the discrimination module, enhancing consistency at the cost of more complex balancing, mediated through .
4. Experimental Results and Evaluation Metrics
Evaluation has been conducted primarily on the Libri2Mix clean test set, using the following metrics:
- DNSMOS (SIG, BAK, OVRL): Perceptual quality and noise suppression.
- NISQA: Overall non-intrusive speech quality assessment.
- SpeechBERT Score: Semantic consistency in self-supervised embedding space.
- dWER: Differential word error rate via Whisper ASR (semantic intelligibility).
- Speaker Similarity: Cosine similarity in WavLM-SV and WeSpeaker spaces.
Comparative results:
- The purely generative LauraTSE achieves OVRL ≈ 3.34, NISQA ≈ 4.33, dWER ≈ 0.159, speaker sim ≈ 0.97 (Zeng et al., 9 Jan 2026), showing high perceptual quality but weaker intelligibility and fidelity compared to discriminative models.
- The discriminative baseline (USEF-TFGridNet-L) has dWER ≈ 0.075 and high speaker similarity (>0.98) but lower OVRL ≈ 3.27, NISQA ≈ 4.32.
- The two-stage USEF-Laura-TSE-L (front-end unfrozen, SI-SDR loss, multiset training) achieves OVRL ≈ 3.32, NISQA ≈ 4.45, dWER ≈ 0.117, speaker sim ≈ 0.98, representing a balance between semantic consistency and perceptual enhancement.
| Model | OVRL | NISQA | dWER | Speaker Sim |
|---|---|---|---|---|
| USEF-TFGridNet-L (D) | 3.27 | 4.32 | 0.075 | >0.98 |
| LauraTSE (AR+vocoder) (G) | 3.34 | 4.33 | 0.159 | 0.97 |
| USEF-Laura-TSE-L (D+G) | 3.32 | 4.45 | 0.117 | 0.98 |
D: Discriminative, G: Generative, D+G: Discriminative–Generative two-stage.
5. Ablation Studies and Architectural Insights
- RVQ Layer Depth ():
Varying from 1 to 3 in the AR LM results in minimal differences in SIG/OVR metrics (≤0.03), indicating that even minimal coarse content suffices for downstream performance (Tang et al., 10 Apr 2025).
- Input Modality:
Discrete I/O, where Stage I operates purely on tokens rather than continuous embeddings, underperforms continuous-feature input modes. Replacing the Conformer with a frozen ASR Conformer or WavLM embeddings degrades speaker-related scores.
- Data Efficiency:
LauraTSE trained on 460 h LibriSpeech matches or outperforms other generative models trained on 5000 h multi-task data (e.g., AnyEnhance) in speaker and semantic similarity metrics.
- Training Regime:
In joint training, fine-tuning the discriminative front-end (as opposed to freezing) and applying SI-SDR loss with a tuned improves semantic fidelity without perceptual degradation.
6. Limitations, Extensions, and Outlook
While LauraTSE establishes strong efficiency in single-task settings and shows effective performance with coarse-to-fine generation, several limitations are reported:
- Absence of explicit large-scale data scaling studies; current findings suggest data efficiency, but further quantification is needed.
- The two-stage coarse-to-fine structure increases system complexity and latency, suggesting a potential benefit for unified end-to-end sequence modeling.
- The neural codec is trained on clean speech only; adapting or jointly training the codec in noisy or mixture conditions may yield further robustness.
- Extending the backbone to multimodal or multi-task setups, such as noise suppression or dereverberation, is highlighted as a future direction (Tang et al., 10 Apr 2025).
References
- "Discriminative-Generative Target Speaker Extraction with Decoder-Only LLMs" (Zeng et al., 9 Jan 2026).
- "LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only LLMs" (Tang et al., 10 Apr 2025).