Papers
Topics
Authors
Recent
Search
2000 character limit reached

LauraTSE: Target Speaker Extraction

Updated 13 January 2026
  • LauraTSE is a generative framework for target speaker extraction that employs auto-regressive decoders and neural audio codec representations.
  • The system integrates a Conformer-based feature encoder and an encoder-only vocoder LM to reconstruct high-quality speech from mixed inputs.
  • A two-stage discriminative–generative pipeline refines coarse estimates, balancing perceptual enhancement with semantic fidelity as validated on Libri2Mix metrics.

LauraTSE is a generative framework for target speaker extraction (TSE), employing an auto-regressive (@@@@10@@@@) decoder-only LLM in conjunction with neural audio codec representations. It is designed for the separation and high-fidelity reconstruction of speech from a specific target speaker, given a short enrollment utterance of that speaker, from a mixed audio input. In its most advanced configuration, LauraTSE is integrated into a two-stage discriminative–generative pipeline (USEF-Laura-TSE), which leverages the control and robustness of a discriminative front-end with the perceptual quality enhancement capabilities of a generative back-end (Zeng et al., 9 Jan 2026, Tang et al., 10 Apr 2025).

1. System Architecture

The core LauraTSE system consists of three main components:

(a) Feature Encoder (Conformer):

A Conformer-based encoder processes both the enrollment utterance (rr) and the mixed speech input (mm), generating continuous embeddings ErE_r and EmE_m from log-mel spectrogram inputs:

Em=C(Mm),  Er=C(Mr).E_m = \mathcal{C}(M_m),\; E_r = \mathcal{C}(M_r).

(b) AR Decoder-Only LLM (Decoder-only LM):

Conditioned on [bos;Er;sep;Em;tse][bos; E_r; sep; E_m; tse], this component autoregressively predicts the discrete tokens for the first nn residual vector quantization (RVQ) codec layers of the target speech:

Pθ(D^nEm,Er)=i=1TPθ(D^n(i)D^n(1:i1),Em,Er).P_\theta(\hat{D}_n | E_m, E_r) = \prod_{i=1}^T P_\theta(\hat{D}_n^{(i)} | \hat{D}_n^{(1:i-1)}, E_m, E_r).

(c) Encoder-Only Vocoder LLM (Vocoder LM):

This one-step Transformer takes [Er,Em,D^n][E_r, E_m, \hat{D}_n] and outputs a fine-grained embedding E^s\hat{E}_s representing the sum of all RVQ layers, which is then passed through a frozen neural audio codec decoder to produce the final waveform.

Two-Stage Discriminative–Generative Extension:

In the USEF-Laura-TSE framework, a discriminative front-end (USEF-TFGridNet) first extracts a coarse target speech estimate Do=D(m,r)D_o = \mathcal{D}(m, r) through a sequence of operations: STFT, 2-D convolutional encoding, cross multi-head attention, concatenation, TF-GridNet processing, and decoding (transposed-conv + iSTFT). The generative LauraTSE back-end (G\mathcal{G}) then refines this intermediate representation:

Go=G(Do),G_o = \mathcal{G}(D_o),

treating DoD_o as the conditional context in place of Em,ErE_m, E_r to reconstruct the final high-fidelity waveform (Zeng et al., 9 Jan 2026).

2. Mathematical Formulations and Training Losses

Generative Module

  • AR LM (Decoder-only):

Standard cross-entropy loss over predicted codec tokens at each time step and across nn layers:

LAR=t=1Tl=1nlogPθ(tokenGT(t,l)token(<t,l),Em,Er)\mathcal{L}_{AR} = - \sum_{t=1}^T \sum_{l=1}^n \log P_\theta(\mathrm{token}_{\mathrm{GT}}^{(t,l)} \mid \mathrm{token}^{(<t,l)}, E_m, E_r)

  • Encoder-Only Vocoder LM:

Supervised with ground-truth summed embeddings, using an L1+L2L_1 + L_2 loss:

Lvoc=E^sEs1+E^sEs22\mathcal{L}_{voc} = \| \hat{E}_s - E_s \|_1 + \| \hat{E}_s - E_s \|_2^2

Discriminative Front-End

  • Initially trained with complex-spectral MSE; during joint training, optionally includes an SI-SDR loss:

SI-SDR=10log10(sproj2serr2)\mathrm{SI\text{-}SDR} = -10 \log_{10}\left( \frac{\|s_{proj}\|^2}{\|s_{err}\|^2} \right)

where sproj=S^,SS2Ss_{proj} = \frac{\langle \hat{S}, S \rangle}{\|S\|^2} S and serr=S^sprojs_{err} = \hat{S} - s_{proj}.

Combined Objective

  • For joint training with an unfrozen front-end:

Ltotal=Lgen+λSI-SDRLSI-SDR\mathcal{L}_{total} = \mathcal{L}_{gen} + \lambda_{SI\text{-}SDR} \cdot \mathcal{L}_{SI\text{-}SDR}

where Lgen\mathcal{L}_{gen} aggregates the AR and vocoder losses.

3. Inference Procedures and Collaborative Training

  • Pipeline:
  1. Extract log-mel spectrograms for rr and mm.
  2. Encode with the shared Conformer to obtain Er,EmE_r, E_m.
  3. Run the AR LM for coarse token prediction.
  4. Map each token to its embedding, sum across layers for D^n(t)\hat{D}_n(t).
  5. Invoke the encoder-only LM to compute E^s\hat{E}_s.
  6. Decode via neural audio codec to recover the time-domain waveform.
  • Two-Stage Mode:

At inference, the AR LM’s free-running output may be partially or completely substituted with the discriminative front-end’s own codec-encoded tokens at an injection ratio R[0,1]R \in [0, 1], trading between fully autoregressive generation (R=0)(R=0) and total reliance on front-end pseudo-labels (R=1)(R=1).

  • Front-End Training Variation:
    • Freeze: Discriminative module weights fixed; generative back-end learns to refine the fixed output.
    • Unfreeze: Gradients flow from generative losses into the discrimination module, enhancing consistency at the cost of more complex balancing, mediated through λSI-SDR\lambda_{SI\text{-}SDR}.

4. Experimental Results and Evaluation Metrics

Evaluation has been conducted primarily on the Libri2Mix clean test set, using the following metrics:

  • DNSMOS (SIG, BAK, OVRL): Perceptual quality and noise suppression.
  • NISQA: Overall non-intrusive speech quality assessment.
  • SpeechBERT Score: Semantic consistency in self-supervised embedding space.
  • dWER: Differential word error rate via Whisper ASR (semantic intelligibility).
  • Speaker Similarity: Cosine similarity in WavLM-SV and WeSpeaker spaces.

Comparative results:

  • The purely generative LauraTSE achieves OVRL ≈ 3.34, NISQA ≈ 4.33, dWER ≈ 0.159, speaker sim ≈ 0.97 (Zeng et al., 9 Jan 2026), showing high perceptual quality but weaker intelligibility and fidelity compared to discriminative models.
  • The discriminative baseline (USEF-TFGridNet-L) has dWER ≈ 0.075 and high speaker similarity (>0.98) but lower OVRL ≈ 3.27, NISQA ≈ 4.32.
  • The two-stage USEF-Laura-TSE-L (front-end unfrozen, SI-SDR loss, multiset training) achieves OVRL ≈ 3.32, NISQA ≈ 4.45, dWER ≈ 0.117, speaker sim ≈ 0.98, representing a balance between semantic consistency and perceptual enhancement.
Model OVRL NISQA dWER Speaker Sim
USEF-TFGridNet-L (D) 3.27 4.32 0.075 >0.98
LauraTSE (AR+vocoder) (G) 3.34 4.33 0.159 0.97
USEF-Laura-TSE-L (D+G) 3.32 4.45 0.117 0.98

D: Discriminative, G: Generative, D+G: Discriminative–Generative two-stage.

5. Ablation Studies and Architectural Insights

  • RVQ Layer Depth (nqn_q):

Varying nqn_q from 1 to 3 in the AR LM results in minimal differences in SIG/OVR metrics (≤0.03), indicating that even minimal coarse content suffices for downstream performance (Tang et al., 10 Apr 2025).

  • Input Modality:

Discrete I/O, where Stage I operates purely on tokens rather than continuous embeddings, underperforms continuous-feature input modes. Replacing the Conformer with a frozen ASR Conformer or WavLM embeddings degrades speaker-related scores.

  • Data Efficiency:

LauraTSE trained on 460 h LibriSpeech matches or outperforms other generative models trained on 5000 h multi-task data (e.g., AnyEnhance) in speaker and semantic similarity metrics.

  • Training Regime:

In joint training, fine-tuning the discriminative front-end (as opposed to freezing) and applying SI-SDR loss with a tuned λSI-SDR\lambda_{SI\text{-}SDR} improves semantic fidelity without perceptual degradation.

6. Limitations, Extensions, and Outlook

While LauraTSE establishes strong efficiency in single-task settings and shows effective performance with coarse-to-fine generation, several limitations are reported:

  • Absence of explicit large-scale data scaling studies; current findings suggest data efficiency, but further quantification is needed.
  • The two-stage coarse-to-fine structure increases system complexity and latency, suggesting a potential benefit for unified end-to-end sequence modeling.
  • The neural codec is trained on clean speech only; adapting or jointly training the codec in noisy or mixture conditions may yield further robustness.
  • Extending the backbone to multimodal or multi-task setups, such as noise suppression or dereverberation, is highlighted as a future direction (Tang et al., 10 Apr 2025).

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LauraTSE.