Papers
Topics
Authors
Recent
Search
2000 character limit reached

Language-Queried Audio Source Separation (LASS)

Updated 15 April 2026
  • Language-Queried Audio Source Separation (LASS) is a paradigm that uses free-form text queries to extract semantically matching audio from multi-source mixtures.
  • It employs multimodal neural architectures combining text encoders (e.g., BERT, CLAP) with audio backbones like ResUNet and attention-based fusion for precise source isolation.
  • LASS enables open-domain, fine-grained audio extraction, with applications in auditory scene analysis and sound event detection, while addressing challenges like computational cost and domain biases.

Language-Queried Audio Source Separation (LASS) is a paradigm in computational auditory scene analysis in which an audio source is separated from a multi-source mixture conditioned on a free-form natural language query specifying the target. The system receives a time-domain waveform containing an arbitrary number of sources and a text query (e.g., "a male singer with guitar in the background") and returns an estimate of the component of the waveform semantically corresponding to the description, while suppressing unrelated sources. Unlike classical source separation by class or instrument label, LASS enables open-domain, fine-grained, and flexible audio extraction compatible with real-world application scenarios (Liu et al., 2022, Liu et al., 2023, Xiao et al., 2024).

1. Formal Problem Definition and Modeling

Let xRTx \in \mathbb{R}^T denote the observed time-domain mixture of unknown sources x(t)=i=1Nsi(t)x(t) = \sum_{i=1}^N s_i(t), and cc a natural language text description specifying the target to extract. The LASS task is to learn a mapping FF (typically a parameterized neural network) so that s^=F(x,c)\hat{s} = F(x, c) isolates the audio content semantically aligned with cc.

Early approaches, such as LASS-Net, use a ResUNet backbone operating in the STFT magnitude domain with skip connections and explicit text-audio fusion. Text encoding employs pretrained LLMs (e.g., compact BERT), and conditioning is realized through transformations such as FiLM (Feature-wise Linear Modulation), where the query embedding modulates the convolutional activations at each layer (Liu et al., 2022). The estimated source magnitude is reconstructed via a sigmoid mask, and the time-domain output is rendered using the inverse STFT.

Recent models generalize to cross-attentional and fusion-based variants, replacing the BERT encoder with large foundation text encoders such as CLAP (Contrastive Language-Audio Pretraining) or FLAN-T5, and incorporating more structured fusions, e.g., cross-attention, multi-head attention, or multi-level semantic modeling (Liu et al., 2023, Yin et al., 27 May 2025, Mahmud et al., 2024).

2. Model Architectures and Conditioning Mechanisms

LASS systems are typified by their multimodal structure.

3. Training Strategies, Loss Functions, and Data

LASS models require paired (audio, text) data, typically synthesized mixtures with human- or machine-generated captions (e.g., AudioCaps (Liu et al., 2022), Clotho, FSD50K, WavCaps, DCASE24-T9). The training regime usually employs:

4. Evaluation Metrics and Benchmarking

Traditional separation metrics include SI-SDR, SDR, SIR, and SAR, all requiring ground-truth reference sources (Liu et al., 2022, Liu et al., 2023). These metrics, however, do not capture semantic alignment with the query and are infeasible for real-world deployments lacking reference tracks.

  • CLAPScore: A reference-free, semantic similarity metric based on the CLAP model. For a separated signal s^\hat{s} and text query cc, CLAPScore computes normalized cosine similarity between the respective CLAP audio and text embeddings. Values close to 1 indicate strong semantic alignment (Xiao et al., 2024).
  • CLAPScore-i: Improvement over the mixture, i.e., CLAPScore(s^,c)CLAPScore(x,c)\text{CLAPScore}(\hat{s},c) - \text{CLAPScore}(x,c), quantifies separation gain relative to the original mixture.
  • RefCLAPScore: Harmonic mean of CLAPScore for the separated output and (if available) the ground-truth, enabling more nuanced evaluation when references exist.

Correlations between CLAPScore and SI-SDR are moderate (PCC ≈ 0.27–0.29 on DCASE24-T9) (Xiao et al., 2024). For speech-centric applications, speaker-purity (measured by LLM-as-judge and human rating), and BERTScore for transcribed summaries, are also used (Okocha et al., 21 Oct 2025).

5. Advances, Variations, and Application Scenarios

Recent advances address open-vocabulary generalization, robustness, and domain adaptation:

  • Open World and Multi-Source Separation: OpenSep leverages textual inversion (audio captioning + LLM source parsing) and knowledge-based LLM prompting (few-shot source description) to enable open, variable-source separation (Mahmud et al., 2024).
  • Hierarchical and Multi-Stage Models: HSM-TSS decomposes alignment (global: Q-Audio/FLAN-T5; local: AudioMAE) and reconstruction (TF-Codec), supporting bidirectional (extract/remove) operations and instruction parsing (Yin et al., 27 May 2025).
  • Generative and Diffusion Approaches: FlowSep applies rectified flow matching in latent space for efficient, high-fidelity separation, outperforming diffusion-based baselines in speed and quality (Yuan et al., 2024). PromptSep extends to multimodal prompting (text, vocal imitation), sound removal, and general-purpose operator control (Wen et al., 6 Nov 2025). DGMO demonstrates zero-shot separation using pretrained diffusion priors with mask optimization (Lee et al., 3 Jun 2025).
  • Weakly/Semi-Supervised Training: Bi-modal similarity approaches reduce reliance on labeled singles by learning from mix-and-separate and CLAP space, reaching >97% of supervised performance in some settings (Mahmud et al., 2024).
  • Augmentation and Data Efficiency: Caption augmentation using tailored LLM prompting (e.g., WavCaps prompts with SVO structure and entity anonymization) yields notable SDR/SI-SDR gains (Δ ≈ +1.7 dB) (Lee et al., 2024).

Applications span computational auditory scene analysis, robust sound event detection in noise (with co-training, event counting, and LASS-augmented SED) (Yin et al., 2024, Chen et al., 10 Aug 2025, Yin et al., 2024), audio-visual separation (e.g., leveraging CLIP for trimodal alignment) (Tan et al., 2023), and clinical scenarios (speaker extraction for child speech processing) (Okocha et al., 21 Oct 2025).

6. Limitations and Future Directions

Notwithstanding strong progress, several limitations persist:

  • Domain Bias and Embedding Alignment: LASS systems depend on the generalization of their language encoders (CLAP, FLAN-T5) to out-of-domain queries, languages, and rare sounds. CLAPScore may misrepresent rare classes or descriptions outside the pretraining corpus (Xiao et al., 2024, Yin et al., 27 May 2025).
  • Multiple Valid Queries: Metrics such as CLAPScore only evaluate proximity to the supplied text, not alternative valid descriptions.
  • Scalability to Real Scenes: While hierarchical and generative approaches improve complexity handling, dense polyphony and highly variable mixtures remain challenging.
  • Inference and Computational Cost: Generative approaches, especially diffusion-based models, entail substantial test-time cost (e.g., DGMO requires >500 optimization epochs per sample), though FlowSep and other RFM methods offer significant speedup (Yuan et al., 2024, Lee et al., 3 Jun 2025).
  • Training Data Limitations: The absence of sufficient diverse and high-quality cross-modal data remains a bottleneck for enhancing real-world robustness (Yin et al., 27 May 2025, Lee et al., 2024).

Key avenues for future work include end-to-end multimodal prompt tuning, full-bandwidth and high-frequency-aware pretraining (especially for SSL and CLAP), integration of multimodal (text, vocal imitation, vision) conditioning, jointly learned query embeddings, hierarchical/multi-query separation, and scalable, efficient generative architectures (Yin et al., 27 May 2025, Wen et al., 6 Nov 2025, Feng et al., 20 Jun 2025, Mahmud et al., 2024).

7. Benchmark Results and Comparative Summary

The following table (condensed) summarizes key reported results for major LASS models on separation metrics as reported in the literature.

Method Dataset SDR (dB) SI-SDR (dB) CLAPScore (semantic) Notable Features
AudioSep DCASE24-T9 8.19 6.68 0.261 Large ResUNet, CLAP, FiLM, open-domain
FlowSep AudioCaps 21.9 (FAD: 2.86) Generative, RFM in VAE latents
HSM-TSS 3Sets 0.436 (CLAP), 0.752 (AFSim) Hierarchical global/local separation
OpenSep MUSIC 9.56 LLM-based caption parsing, multi-level
Hybrid-Sep DE-S 8.82 27.6% SSL fusion, adversarial+diffusion loss
PromptSep ASFX SDRi 5.65 Diffusion (DAC VAE), vocal imitation
Weakly Sup MUSIC (unsup) 7.9 CLAP bi-modal loss, no singles

All results are as reported in the respective sources. For context, classical masking-based UNet or LASS-Net approaches yield 5.4–5.9 dB SDR on LASS-Test (Liu et al., 2022), while late-generation systems employing open-domain architectures and text-conditional fusion surpass 8 dB (Liu et al., 2023, Xiao et al., 2024).


For code, pretrained models, and additional quantitative and qualitative results, refer to cited repositories in (Liu et al., 2023, Xiao et al., 2024, Mahmud et al., 2024), and (Wen et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Language-Queried Audio Source Separation (LASS).