LauraTSE: Target Speaker Extraction
- LauraTSE is a generative framework for target speaker extraction that employs auto-regressive decoders and neural audio codec representations.
- The system integrates a Conformer-based feature encoder and an encoder-only vocoder LM to reconstruct high-quality speech from mixed inputs.
- A two-stage discriminative–generative pipeline refines coarse estimates, balancing perceptual enhancement with semantic fidelity as validated on Libri2Mix metrics.
LauraTSE is a generative framework for target speaker extraction (TSE), employing an auto-regressive (AR) decoder-only LLM in conjunction with neural audio codec representations. It is designed for the separation and high-fidelity reconstruction of speech from a specific target speaker, given a short enrollment utterance of that speaker, from a mixed audio input. In its most advanced configuration, LauraTSE is integrated into a two-stage discriminative–generative pipeline (USEF-Laura-TSE), which leverages the control and robustness of a discriminative front-end with the perceptual quality enhancement capabilities of a generative back-end (Zeng et al., 9 Jan 2026, Tang et al., 10 Apr 2025).
1. System Architecture
The core LauraTSE system consists of three main components:
(a) Feature Encoder (Conformer):
A Conformer-based encoder processes both the enrollment utterance () and the mixed speech input (), generating continuous embeddings and from log-mel spectrogram inputs:
(b) AR Decoder-Only LLM (Decoder-only LM):
Conditioned on , this component autoregressively predicts the discrete tokens for the first residual vector quantization (RVQ) codec layers of the target speech:
(c) Encoder-Only Vocoder LLM (Vocoder LM):
This one-step Transformer takes and outputs a fine-grained embedding representing the sum of all RVQ layers, which is then passed through a frozen neural audio codec decoder to produce the final waveform.
Two-Stage Discriminative–Generative Extension:
In the USEF-Laura-TSE framework, a discriminative front-end (USEF-TFGridNet) first extracts a coarse target speech estimate 0 through a sequence of operations: STFT, 2-D convolutional encoding, cross multi-head attention, concatenation, TF-GridNet processing, and decoding (transposed-conv + iSTFT). The generative LauraTSE back-end (1) then refines this intermediate representation:
2
treating 3 as the conditional context in place of 4 to reconstruct the final high-fidelity waveform (Zeng et al., 9 Jan 2026).
2. Mathematical Formulations and Training Losses
Generative Module
- AR LM (Decoder-only):
Standard cross-entropy loss over predicted codec tokens at each time step and across 5 layers:
6
- Encoder-Only Vocoder LM:
Supervised with ground-truth summed embeddings, using an 7 loss:
8
Discriminative Front-End
- Initially trained with complex-spectral MSE; during joint training, optionally includes an SI-SDR loss:
9
where 0 and 1.
Combined Objective
- For joint training with an unfrozen front-end:
2
where 3 aggregates the AR and vocoder losses.
3. Inference Procedures and Collaborative Training
- Pipeline:
- Extract log-mel spectrograms for 4 and 5.
- Encode with the shared Conformer to obtain 6.
- Run the AR LM for coarse token prediction.
- Map each token to its embedding, sum across layers for 7.
- Invoke the encoder-only LM to compute 8.
- Decode via neural audio codec to recover the time-domain waveform.
- Two-Stage Mode:
At inference, the AR LM’s free-running output may be partially or completely substituted with the discriminative front-end’s own codec-encoded tokens at an injection ratio 9, trading between fully autoregressive generation 0 and total reliance on front-end pseudo-labels 1.
- Front-End Training Variation:
- Freeze: Discriminative module weights fixed; generative back-end learns to refine the fixed output.
- Unfreeze: Gradients flow from generative losses into the discrimination module, enhancing consistency at the cost of more complex balancing, mediated through 2.
4. Experimental Results and Evaluation Metrics
Evaluation has been conducted primarily on the Libri2Mix clean test set, using the following metrics:
- DNSMOS (SIG, BAK, OVRL): Perceptual quality and noise suppression.
- NISQA: Overall non-intrusive speech quality assessment.
- SpeechBERT Score: Semantic consistency in self-supervised embedding space.
- dWER: Differential word error rate via Whisper ASR (semantic intelligibility).
- Speaker Similarity: Cosine similarity in WavLM-SV and WeSpeaker spaces.
Comparative results:
- The purely generative LauraTSE achieves OVRL ≈ 3.34, NISQA ≈ 4.33, dWER ≈ 0.159, speaker sim ≈ 0.97 (Zeng et al., 9 Jan 2026), showing high perceptual quality but weaker intelligibility and fidelity compared to discriminative models.
- The discriminative baseline (USEF-TFGridNet-L) has dWER ≈ 0.075 and high speaker similarity (>0.98) but lower OVRL ≈ 3.27, NISQA ≈ 4.32.
- The two-stage USEF-Laura-TSE-L (front-end unfrozen, SI-SDR loss, multiset training) achieves OVRL ≈ 3.32, NISQA ≈ 4.45, dWER ≈ 0.117, speaker sim ≈ 0.98, representing a balance between semantic consistency and perceptual enhancement.
| Model | OVRL | NISQA | dWER | Speaker Sim |
|---|---|---|---|---|
| USEF-TFGridNet-L (D) | 3.27 | 4.32 | 0.075 | >0.98 |
| LauraTSE (AR+vocoder) (G) | 3.34 | 4.33 | 0.159 | 0.97 |
| USEF-Laura-TSE-L (D+G) | 3.32 | 4.45 | 0.117 | 0.98 |
D: Discriminative, G: Generative, D+G: Discriminative–Generative two-stage.
5. Ablation Studies and Architectural Insights
- RVQ Layer Depth (3):
Varying 4 from 1 to 3 in the AR LM results in minimal differences in SIG/OVR metrics (≤0.03), indicating that even minimal coarse content suffices for downstream performance (Tang et al., 10 Apr 2025).
- Input Modality:
Discrete I/O, where Stage I operates purely on tokens rather than continuous embeddings, underperforms continuous-feature input modes. Replacing the Conformer with a frozen ASR Conformer or WavLM embeddings degrades speaker-related scores.
- Data Efficiency:
LauraTSE trained on 460 h LibriSpeech matches or outperforms other generative models trained on 5000 h multi-task data (e.g., AnyEnhance) in speaker and semantic similarity metrics.
- Training Regime:
In joint training, fine-tuning the discriminative front-end (as opposed to freezing) and applying SI-SDR loss with a tuned 5 improves semantic fidelity without perceptual degradation.
6. Limitations, Extensions, and Outlook
While LauraTSE establishes strong efficiency in single-task settings and shows effective performance with coarse-to-fine generation, several limitations are reported:
- Absence of explicit large-scale data scaling studies; current findings suggest data efficiency, but further quantification is needed.
- The two-stage coarse-to-fine structure increases system complexity and latency, suggesting a potential benefit for unified end-to-end sequence modeling.
- The neural codec is trained on clean speech only; adapting or jointly training the codec in noisy or mixture conditions may yield further robustness.
- Extending the backbone to multimodal or multi-task setups, such as noise suppression or dereverberation, is highlighted as a future direction (Tang et al., 10 Apr 2025).
References
- "Discriminative-Generative Target Speaker Extraction with Decoder-Only LLMs" (Zeng et al., 9 Jan 2026).
- "LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only LLMs" (Tang et al., 10 Apr 2025).