USEF-TSE: Universal Extraction & Classification

Updated 5 December 2025

The paper illustrates that USEF-TSE integrates a multi-stage pipeline combining USS, SC, and TSE blocks with iterative refinement to autonomously guide target sound extraction and classification.
USEF-TSE is defined as a universal extraction framework that leverages deep attention and physics-informed operators to overcome traditional bottlenecks like explicit embeddings.
Empirical results demonstrate significant improvements in CA-SDRi and classification accuracy across diverse domains such as DCASE sound scenes, speaker extraction, and MRI super-resolution.

USEF-TSE denotes several distinct technical frameworks in recent literature, all sharing the acronym but operating across diverse domains spanning acoustic signal extraction/classification, speaker extraction, and magnetic resonance imaging (MRI) super-resolution. The most prominent and recent usage refers to a multi-stage self-guided sound extraction and classification pipeline developed for the DCASE 2025 Task 4 challenge, but other notable frameworks in speech and MRI domains share this acronym. Each instantiation integrates domain-specific architectures and learning objectives, but the unifying theme is the removal of informational bottlenecks such as explicit embeddings or preselection: all are “universal,” leveraging deep attention or physics-informed joint optimization techniques to extract or reconstruct target signals without external guidance.

1. Multi-Stage Self-Guided Target Sound Extraction and Classification (DCASE 2025)

USEF-TSE, as introduced by the DCASE 2025 Task 4 winner, is a modular framework for spatial semantic segmentation of sound scenes that integrates Universal Sound Separation (USS), Single-label Classification (SC), and Target Sound Extraction (TSE) blocks within a tightly coupled iterative refinement loop (Kwon et al., 17 Sep 2025). The architecture decomposes an audio mixture $X$ into object-level source waveforms, classifies these waveforms, then uses the combination of waveform and its predicted class as a self-generated “clue” for further target extraction. This self-guided loop differentiates USEF-TSE from systems requiring external extraction targets.

Pipeline:

USS (DeFT-Mamba-USS): Decomposes the mixture into $M$ separated sources $\{\hat s_m\}$ , encompassing foregrounds, interferences, and background.
SC (M2D-SC): Assigns a class label $\hat y_m$ (or silence) to each source via transformer-based audio encoding and classification heads utilizing ArcFace and energy-based losses.
TSE (DeFT-Mamba-TSE): Refines source estimates by conditioning on separated waveforms and their predicted class clues via concatenated spectrograms and class embeddings injected through Residual FiLM.
Iterative refinement: The process is looped by reclassifying newly extracted sources, updating the clues for one or more further TSE passes.

The system autonomously infers extraction targets by propagating its own separations and class assignments, providing end-to-end self-direction.

2. Mathematical Objectives and Optimization Criteria

Each module in USEF-TSE is trained on composite, task-aligned loss functions:

USS: Multi-task losses, targeting negative source-aggregated SDR (SA-SDR) for foreground+interference and SI-SNR for background noise, prioritized via a weight $\lambda = 0.01$ . Class decoding uses cross-entropy, Kullback-Leibler divergence for silences, and binary BCE gating.
SC: ArcFace loss ensures angular margin between classes; energy-based hinge loss encourages confident isolation of active/silence events, with KL divergence regularization.
TSE: Masked SNR loss applies only over active segments to sharpen separation.
Overall: The joint training objective $L_\text{total}$ balances USS, cross-entropy, KL, and silence penalties to optimize both separation and recognition.

For model evaluation, the primary metric is class-aware SDR improvement (CA-SDRi), defined as:

$\text{CA-SDRi} = \frac{1}{|C \cup \hat C|} \sum_{k \in C \cup \hat C} P_k$

where $P_k$ is SDRi for true positives, zero for false alarms/misses.

After an initial USS-SC-TSE pass, the system recursively feeds the extracted and classified sources back, updating class clues and conditioning subsequent TSE cycles. Empirical results indicate that two passes (i.e., one refinement cycle) suffice for convergence, with performance gains saturating beyond this point. Convergence is determined empirically, as no explicit analytic stopping criterion is enforced.

Empirical ablations demonstrate incremental CA-SDRi and accuracy improvements through each cascade:

Stage Variant	CA-SDRi (dB)	Accuracy (\%)
USS + SC only (no TSE)	10.8	73.2
USS + SC (post-update)	12.7	81.8
+1 TSE (old class)	14.6	—
+1 TSE (new class)	14.7	83.4
+2 TSE passes	14.9	84.5

4. Network Architectures and Implementation

The backbone for both USS and TSE modules is the DeFT-Mamba architecture, employing stacked frequency- and time-hybrid Mamba blocks. Each replaces the traditional FFN in a transformer with a Mamba-FFN for efficient modeling of complex spectrograms. SC is implemented using a four-layer transformer (M2D-SC) on mel-spectrograms, fine-tuned on the last two layers and output head.

Conditioning in DeFT-Mamba-TSE involves raw waveform concatenation and the injection of learned class-conditional parameters via Residual FiLM. Training utilizes AdamW with a learning rate of $4 \times 10^{-4}$ for DeFT-Mamba modules.

5. Empirical Validation and Benchmark Results

USEF-TSE, evaluated on DCASE 2025 Task 4 mixtures (augmented with VCTK and percussive samples), achieves a CA-SDRi of 11.00 dB (evaluation set) and 14.94 dB (test set), with mixture-level classification accuracy of 55.8% and 61.8%, respectively. This represents a state-of-the-art improvement over the ResUNetK baseline (+4.4 dB SDR, +4.3% accuracy), and was confirmed by ablation studies to result from both TSE refinement and the self-guided integration of class clues (Kwon et al., 17 Sep 2025).

A. Speaker Extraction USEF-TSE (Universal Speaker Embedding-Free Target Speaker Extraction) in (Zeng et al., 4 Sep 2024) removes the need for explicit speaker embeddings by leveraging a frame-level cross-multi-head attention block, generating temporally aligned target speaker features directly from the enrollment utterance and mixture. The architecture is modular, able to wrap around both time-domain and time-frequency separators (e.g., SepFormer, TF-GridNet), and is optimized for SI-SDR. SOTA results are attained on WSJ0-2mix (SI-SDRi = 23.3 dB with USEF-TFGridNet). The same philosophy informs USEF-TP (Zeng et al., 7 Jan 2025), which couples TSE with personal VAD via multi-task learning, outperforming embedding-based approaches on LibriMix.

B. MRI Super-Resolution In MRI, USEF-TSE denotes joint MR sequence and CNN optimization for TSE super-resolution (Dang et al., 2023). Here, differentiable Bloch/EPG physics simulation is embedded in the training loop, allowing end-to-end tuning of both the TSE RF pulse train and the downstream CNN. The approach yields images with higher PSNR and SSIM than pure learning-based SR methods, both in simulation and in vivo. This suggests that physics-informed operator learning, in concert with deep models, can maximize image fidelity in clinically constrained acquisition times.

7. Concluding Remarks and Cross-Domain Significance

USEF-TSE frameworks are characterized by their universal, embedding-free, or operator-integrated architectures that eliminate reliance on external bottlenecks (e.g., pre-trained speaker or object embeddings, external target selection). Across sound scene analysis, speaker extraction, and imaging, the methodological core is a joint-use of self-guided, or attention-based, or known-physics operators to maximize task-relevant signal extraction. USEF-TSE paradigms have demonstrated state-of-the-art performance in DCASE, speech separation, and MR imaging benchmarks, and motivate further investigation into modular, self-sufficient architectures adaptable to complex real-world scenarios (Kwon et al., 17 Sep 2025, Zeng et al., 4 Sep 2024, Zeng et al., 7 Jan 2025, Dang et al., 2023).