Target Sound Extraction (TSE)

Updated 11 September 2025

Target Sound Extraction (TSE) is a conditional source separation technique that isolates designated sound components from mixtures using auxiliary clues such as audio examples, class labels, or textual queries.
It employs diverse conditioning strategies and neural architectures—including convolutional, transformer, and latent-space generative models—to enhance extraction accuracy and robustness.
Applications span smart hearing interfaces, ASR pipelines, teleconferencing, and audio post-production, while current research tackles challenges like scalability, latency, and generalization.

Target Sound Extraction (TSE) refers to the problem of isolating from an audio mixture only the signal components corresponding to one or more specified target sound event classes. The formal objective is to estimate the time-domain (or time–frequency) source contribution for a designated class, given access to a reference clue—be it an example waveform, a class label, temporal boundaries, or even a language query—while suppressing all other sources and interferers. TSE has become a central technical paradigm for the development of smart hearing interfaces, auditory scene analysis models, and robust ASR pipelines, offering clarity and selectivity beyond what classical blind separation or denoising pipelines provide.

1. Conceptual Foundations and Problem Formulations

Target Sound Extraction is a conditional source separation task: the system receives a mixed signal $x$ and a clue $C$ for the target class (e.g., sound event label, enrollment audio, text description, timestamp), and must estimate $\hat{s}_C$ , the audio waveform (or spectrogram) corresponding only to the target class. Formally, for mixture $x$ , targets $\mathcal{C}$ , and mixture model $x = \sum_j s_j + n$ (where each $s_j$ is a source signal and $n$ is noise), the objective is: $\hat{s}_C = \text{TSE}(x; C, \theta)$

Several clue modalities are common:

Class labels: one-hot or n-hot vector specifying event(s) of interest.
Enrollment audio: reference waveform(s) of the target class/event.
Timestamp or region-of-activity: temporal onset/offset or detection map narrowing the relevant portions in time.
Natural language queries: textual description of the target class/event.

The problem is closely related to, but distinct from, universal sound separation and source separation, as TSE leverages auxiliary cues to resolve ambiguities and avoid over-segmentation.

2. Conditioning Strategies and Neural Architectures

TSE systems use various architectures and conditioning mechanisms to inject clue information into the separation pipeline:

Strategy	Typical Clue	Conditioning Mechanism
Class label-based	One-hot/n-hot code	Embedding lookup; fused via FiLM, elementwise, or concat
Enrollment-based	Audio waveform	Encoder to embedding; target embedding fused via cross-attn or elementwise mult.
Timestamp/Activity	Onset/offset	Detector output (detection scores) modulates loss or network activations
Language-queried	Text/caption	Pre-trained audio-LLM (e.g., CLAP) produces embedding, fused as above

Architectural building blocks include:

Time-domain convolutional models: ConvTasNet and its variants (SoundBeam (Delcroix et al., 2022), SpeakerBeam-SS (Sato et al., 1 Jul 2024)) for causal/real-time settings.
Self-attention/transformer models: for handling multi-clue input and long-range temporal dependencies.
Latent-space generative models: Diffusion or flow-matching based (DPM-TSE (Hai et al., 2023), FlowTSE (Navon et al., 20 May 2025), SoloSpeech (Wang et al., 25 May 2025)) for sharper generative modeling.
Multi-modal encoders: cross-attention or multi-head attention modules for fusing audio, video, and/or textual clues (Li et al., 2023).

Conditioning mechanisms may use simple elementwise multiplication, Feature-wise Linear Modulation (FiLM), cross-attention, or explicit concatenation and normalization. Mutual learning between clue detector and extractor networks has been proposed for better synergy (Wang et al., 2022).

3. Learning, Losses, and Mutual Reinforcement

Most neural TSE models are trained in a supervised manner, minimizing a loss function over a dataset of mixtures, target clues, and reference signals. Common objectives include:

Time/frequency domain mean square error (MSE) between estimate and isolated target.
Scale-Invariant Signal-to-Distortion Ratio improvement (SI-SDRi):

$\mathcal{L}_{\text{SI-SDR}} = -10\log_{10} \frac{ \| \alpha s \|^2 }{ \| \alpha s - \hat{s} \|^2 }$

with optimal scaling $\alpha$ .

Innovations specific to TSE include:

Target-weighted loss: weight frames or regions by detection probability (from a Target Sound Detection network) to focus on active target times, formalized as: $\mathcal{L}_{\text{tse-w}} = \mathcal{L}_{\text{tse}} + \tau \mathcal{L}_{\text{tse-t}}$ where $\mathcal{L}_{\text{tse-t}}$ computes losses only over the active region (Wang et al., 2022).
Mutual learning frameworks: alternately update detection and extraction subnets, letting each benefit from the other’s predictions (Wang et al., 2022).
Multi-task objectives: combine separation and clue classification (context inference) losses to encourage implicit or explicit modeling of scene context (Baligar et al., 21 Mar 2024).

In generative models, diffusion or flow-matching pipelines require objectives that align the generated distribution with the clean target, sometimes adapting noise schedules or prediction parametrization for improved silence/purity (Hai et al., 2023).

4. Robustness and Generalization: Handling Query and Environment Variability

TSE research has identified several generalization challenges and solutions:

Inactive target classes (“inactive speaker”): TSE models may erroneously produce non-silent outputs when the query is not present in the mixture. Solutions include augmenting training with inactive samples and detection submodules (TSE-V and TSE-IS schemes (Delcroix et al., 2022)), or context-aware query refinement (filtering inactive classes out of the query at inference using a joint classifier (Sato et al., 10 Sep 2025)).
Out-of-domain and new class adaptation: Models leveraging few-shot adaptation (averaging embeddings from sparse enrollments and fine-tuning) can extend to unseen target classes with minimal data (Delcroix et al., 2022), while neural architectures sharing embedding space between class and enrollment clues (SoundBeam) support such continuous learning.
Language-queried and audio-only training: To leverage unpaired data, retrieval-augmented training (matching audio embeddings to text embedding caches) or embedding dropout/noise injection can close the modality gap, ensuring effective text-queried extraction (Ma et al., 14 Sep 2024, Saijo et al., 20 Sep 2024).
Cross-modal and multi-cue fusion: Transformer-based architectures that accept arbitrary combinations of clues (audio, text, video) are robust to degraded or missing modalities and flexible in user interaction (Li et al., 2023).
Pitch and spatial cues: Use of conditional pitch extraction via FiLM and learnable Gammatone filterbanks improves reverberant scene robustness (Wang et al., 13 Jun 2024); multichannel frameworks with spatio-temporal clue injection preserve spatial fidelity in the extracted signals (Choi et al., 19 Sep 2024).

5. Performance Benchmarks and Evaluation Criteria

TSE methods are evaluated using a variety of metrics:

Metric	Assesses	Contexts Used
SI-SDR(i)	Extraction quality relative to mixture	Speech/sound TSE
PESQ	Perceptual quality	Speech TSE
ESTOI	Intelligibility	Speech TSE
SNRi	SNR improvement in target region	General TSE
Segment/event F1	Detection accuracy	TSD modules
DNSMOS	Non-intrusive perceptual quality	Speech/HEARABLES
WER	ASR-based intelligibility	Speech TSE
SIM	Speaker similarity	Speaker TSE
Spatial error	Spatial clue preservation	Multichannel TSE

Strong numerical improvements are reported when using timestamp-weighted objectives, mutual learning (Wang et al., 2022), or rich foundation models like M2D (Hernandez-Olivan et al., 19 Sep 2024). For example, SI-SDRi gains of 1–2 dB are observed with mutual loss weighting and multi-stage refinement.

6. Applications and Practical Deployments

TSE underpins a broad range of applications:

Hearing aids and augmented hearing: Targeted sound event enhancement in live or streaming environments; low-latency/causal deployment is addressed by architectures such as SpeakerBeam-SS and CATSE (Sato et al., 1 Jul 2024, Baligar et al., 21 Mar 2024).
Teleconferencing and voice communications: Extraction of a speaker-of-interest for clarity and noise suppression.
Audio post-production: Selective extraction and manipulation of sound sources for editing, post-mixing, or content analysis.
Smart home and surveillance: Monitoring or alerting based on specific sound events.
Continuous and cross-domain learning: Models such as SoundBeam and its M2D-enhanced variant facilitate adaptation to new sound classes and diverse acoustic domains.

Real-world deployments prioritize not only extraction accuracy but computational efficiency, latency (e.g., real-time factors <1 as in SpeakerBeam-SS), and robustness to practical query mismatches (PMQ, FUQ conditions).

7. Limitations and Future Research Directions

Despite algorithmic advances, several challenges persist:

Scalability to open-domain, multilingual, and highly polyphonic scenes.
Integration and unification of generative (diffusion/flow) and discriminative pipelines for best perceptual and intelligibility performance (Hai et al., 2023, Navon et al., 20 May 2025, Wang et al., 25 May 2025).
Automatic query refinement and multi-modal interaction, including leveraging visual or spatio-temporal context robustly in complex scenes (Sato et al., 10 Sep 2025, Li et al., 2023, Choi et al., 19 Sep 2024).
Reducing model size and latency without sacrificing quality for edge/wearable real-time use (Sato et al., 1 Jul 2024, He et al., 2023).
Development of domain-invariant and highly generalizable speaker/event representations, possibly using multi-level, cross-attentive, or foundation model approaches (Zhang et al., 21 Oct 2024, Hernandez-Olivan et al., 19 Sep 2024).
Improved evaluation criteria coupling perceptual scores, downstream ASR performance, and user-focused intelligibility metrics.

Continued research is driven by both technical requirements (latency, accuracy, deployment constraints) and the fundamental scientific question of selective hearing in both humans and machines.

References:

All factual claims, formulas, metrics, and framework names are derived directly from the cited arXiv papers, including (Wang et al., 2022, Delcroix et al., 2022, Delcroix et al., 2022, Liu et al., 2023, Zmolikova et al., 2023, Li et al., 2023, Lin et al., 2023, Hai et al., 2023, He et al., 2023, Baligar et al., 21 Mar 2024, Wang et al., 13 Jun 2024, Sato et al., 1 Jul 2024, Ma et al., 14 Sep 2024, Choi et al., 19 Sep 2024, Hernandez-Olivan et al., 19 Sep 2024, Saijo et al., 20 Sep 2024, Zhang et al., 21 Oct 2024, Navon et al., 20 May 2025, Wang et al., 25 May 2025), and (Sato et al., 10 Sep 2025).