LGTSE: Lightweight Speech Enhancement for TSE
- The paper introduces a novel lightweight TSE framework that leverages a GTCRN denoiser, noise-agnostic context interaction, and an embedding-free backbone for effective target speech extraction.
- The methodology employs distortion-aware training and a TripleC universal strategy, resulting in significant improvements in SI-SDR, PESQ, and STOI metrics.
- A two-stage training process with offline augmentation ensures robustness and state-of-the-art performance with minimal computational overhead.
Lightweight Speech Enhancement Guided Target Speech Extraction (LGTSE) is a neural framework designed to significantly improve target speech extraction (TSE) performance in challenging real-world scenarios, particularly those involving multiple speakers and/or ambient noise. LGTSE leverages a lightweight speech enhancement module (GTCRN) to provide noise-agnostic guidance within an embedding-free TSE backbone, enabling robust extraction across diverse and complex audio mixtures. The approach introduces noise-agnostic context interaction and distortion-aware training to reduce noise contamination during speaker extraction, and demonstrates substantial gains in standard speech intelligibility and quality metrics, with only marginal increases in computational burden (Huang, 4 Dec 2025, Huang et al., 27 Aug 2025).
1. Architecture of the LGTSE Framework
LGTSE comprises three principal components: the front-end denoiser, noise-agnostic context interaction, and an embedding-free backbone extractor. The front-end denoiser employs a GTCRN (Grouped Temporal Convolutional Recurrent Network) to process the STFT-domain mixture, yielding a denoised feature . Both the noisy mixture and the enrollment utterance are transformed via STFT to yield complex spectra: A dynamic range compression with exponent is applied: Denoising is applied to : Noise-agnostic context interaction utilizes cross-attention between the compressed enrollment and the denoised mixture: and are concatenated and propagated through a SEF-PNet style backbone extractor, which estimates the target spectrum. The overall model is highly lightweight: GTCRN ≈50K parameters and ≈0.03G MACs, with the full LGTSE+SEF-PNet system remaining computationally modest (Huang et al., 27 Aug 2025).
2. Training Procedures and Loss Functions
LGTSE employs a two-stage training paradigm:
- Stage 1: Separate pre-training of GTCRN (denoiser) and the TSE backbone (extractor).
- Stage 2: End-to-end fine-tuning of the entire model.
The core training objective is the scale-invariant signal-to-distortion ratio (SI-SDR) loss, applied to the time-domain output after inverse STFT: where is the enhanced model output and is the clean reference.
In extension D-LGTSE, distortion-aware augmentation is introduced, exposing the extraction backbone to both noisy () and denoised () variants during training. This is achieved via three augmentation strategies: channel-wise concatenation, on-the-fly batch enlargement, and offline pre-generation/shuffling of denoised signals. The joint fine-tuning loss combines SI-SDR on both the denoising and extraction tasks: This distortion-aware approach improves robustness to both residual artifacts and real-world mixture distortions (Huang et al., 27 Aug 2025).
3. Cross-Condition Consistency and TripleC Universal Training
To generalize LGTSE to more diverse scenarios—including one-speaker-plus-noise, two-speaker (clean), and two-speaker-plus-noise mixtures—a novel Cross-Condition Consistency learning strategy, termed TripleC, is introduced (Huang, 4 Dec 2025). The motivation is to enforce output alignment across easy (single-speaker+noise) and hard (multi-speaker+noise) cases for the same target.
Given parallel inputs and (with the same enrollment and reference ),
Combined with the SI-SDR objectives: For the most general parallel universal training, each batch consists of three mixture types (all with the same enrollment): single-speaker+noise, two-speaker (clean), and two-speaker+noise. SI-SDR loss applies to all outputs, and TripleC loss applies between the two noisy conditions. This scheme promotes a unified feature space, with easier cases aiding harder ones.
The GTCRN module adaptively denoises or passes through clean mixtures, obviating the need for explicit condition labels or selection mechanisms.
4. Experimental Evaluations and Comparative Performance
Experiments utilize the Libri2Mix dataset (min mode, 8 kHz) with three personalized enhancement conditions: 1-speaker+noise (mix_single), 2-speaker (clean) (mix_clean), and 2-speaker+noise (mix_both). The evaluation applies SI-SDR (dB), PESQ, and STOI (%) as primary metrics, comparing LGTSE, D-LGTSE, and various baselines (Huang, 4 Dec 2025, Huang et al., 27 Aug 2025).
Performance of LGTSE-based models, including advanced parallel universal TripleC systems, is summarized below:
| System / Setting | 1spk+noise SI-SDR (dB) | 2spk (clean) SI-SDR (dB) | 2spk+noise SI-SDR (dB) |
|---|---|---|---|
| SEF-PNet (baseline) | 14.50 | 13.00 | 7.43 |
| LGTSE | 14.50 | 13.18 | 7.88 |
| LGTSE+TripleC (2 cond) | 14.25 | 11.87 | 8.41 |
| TripleC-universal | 14.28 | 13.33 | 8.58 |
Additional results for D-LGTSE (on noisy 2-speaker settings, SEF-PNet backbone) are:
| Method | SI-SDR (dB) | PESQ | STOI (%) |
|---|---|---|---|
| SEF-PNet (baseline) | 7.43 | 2.14 | 80.31 |
| LGTSE | 7.88 | 2.21 | 81.27 |
| D-LGTSE (Offline) | 8.32 | 2.30 | 82.28 |
With more complex backbones (CIE-mDPTNet), D-LGTSE yields further improvement: SI-SDR = +0.83 dB over the mDPTNet baseline.
On Libri2Mix-100 at 16 kHz, TripleC-parallel models outperform DB-BSRNN and diffusion-based NCSN++ models, despite using half the data and no additional speaker embeddings.
5. Impact of Noise-Agnostic Guidance and Distortion-Aware Training
The noise-agnostic context interaction of LGTSE—specifically, using denoised mixture representations to calculate cross-attention with enrollment speech—substantially mitigates noise contamination. This yields cleaner speaker representations that the extraction backbone can exploit more effectively.
Distortion-aware training via D-LGTSE augments model invariance to residual artifacts and speech distortions. Offline augmentation, in which denoised signals are precomputed and intermixed with originals, preserves greater variability and regularizes the model, outperforming channel-wise concatenation and on-the-fly strategies.
The two-stage training (pre-training, then end-to-end fine-tuning) further aligns both denoising and extraction representations, optimizing the coupling between GTCRN and the extraction backbone.
6. Generalization, Limitations, and Future Directions
Adoption of TripleC universal training enables robust operation across unseen mixture types and complex interferences by aligning model outputs for the same target speaker across multiple conditions. However, enforcing strict output consistency can result in performance degradation for the easiest tasks (−0.25 dB in 1spk+noise observed when using TripleC on only two conditions). Further, all LGTSE variants currently operate in an offline, full-context regime, lacking support for real-time or low-latency deployment.
Future research includes extending the framework to new datasets and noise/reverberation profiles, and exploring causal and low-latency architectures suitable for deployment in devices and next-generation speech communication systems.
7. Context within the Field
LGTSE and its successors represent a departure from prior embedding-based TSE systems, achieving strong performance without speaker embeddings or condition labels. The consistent gains in SI-SDR, speech intelligibility (STOI), and perceptual quality (PESQ) validate the effectiveness of lightweight denoising front-ends and noise-agnostic context interaction for TSE in realistic multi-condition environments. Results indicate that LGTSE+TripleC-parallel models define a new state-of-the-art in universal, embedding-free target speech extraction (Huang, 4 Dec 2025, Huang et al., 27 Aug 2025).