Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 129 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Context-Aware Query Refinement for Target Sound Extraction: Handling Partially Matched Queries (2509.08292v1)

Published 10 Sep 2025 in eess.AS and cs.SD

Abstract: Target sound extraction (TSE) is the task of extracting a target sound specified by a query from an audio mixture. Much prior research has focused on the problem setting under the Fully Matched Query (FMQ) condition, where the query specifies only active sounds present in the mixture. However, in real-world scenarios, queries may include inactive sounds that are not present in the mixture. This leads to scenarios such as the Fully Unmatched Query (FUQ) condition, where only inactive sounds are specified in the query, and the Partially Matched Query (PMQ) condition, where both active and inactive sounds are specified. Among these conditions, the performance degradation under the PMQ condition has been largely overlooked. To achieve robust TSE under the PMQ condition, we propose context-aware query refinement. This method eliminates inactive classes from the query during inference based on the estimated sound class activity. Experimental results demonstrate that while conventional methods suffer from performance degradation under the PMQ condition, the proposed method effectively mitigates this degradation and achieves high robustness under diverse query conditions.

Summary

The paper proposes a query refinement method that enhances target sound extraction by removing inactive query classes.
It utilizes a multi-task architecture combining convolutional layers and a BiGRU classifier to jointly perform extraction and sound class estimation.
Experimental results demonstrate maintained SNR improvement under partially matched queries with minimal computational overhead.

Context-Aware Query Refinement for Target Sound Extraction under Partially Matched Queries

Introduction

The paper addresses a critical gap in Target Sound Extraction (TSE) research: the handling of Partially Matched Query (PMQ) conditions, where user-specified queries may include both active and inactive sound classes. While previous work has focused on Fully Matched Query (FMQ) and Fully Unmatched Query (FUQ) scenarios, the PMQ condition reflects realistic user behavior in applications such as hearables and environmental monitoring, where users may not perfectly identify all active sources. The authors propose a context-aware query refinement method that estimates sound class activity in the mixture and refines the query by removing inactive classes, thereby mitigating performance degradation under PMQ conditions.

Figure 1: Overall architecture of the proposed query refinement method, illustrating the joint TSE and sound class estimation pipeline and example queries for FMQ, PMQ, and FUQ conditions.

TSE systems typically operate under the assumption that queries specify only active sources (FMQ). However, in practice, queries may include inactive sources, leading to PMQ and FUQ conditions. Existing approaches for FUQ, such as training with inactive samples (IS), can suppress non-target extraction but degrade FMQ performance. Methods that replace output with silence based on target detection are limited to single-class extraction and cannot address PMQ scenarios, where selective extraction of active classes is required.

The PMQ condition introduces a unique challenge: erroneous extraction of non-target sounds when inactive classes are present in the query. The paper is the first to systematically analyze the severity of this issue and propose a solution that generalizes across FMQ, PMQ, and FUQ conditions.

Proposed Method: Context-Aware Query Refinement

The core of the proposed method is a multi-task architecture that jointly performs TSE and sound class estimation. The system comprises:

Encoder: 1-D convolutional layers transform the input mixture into feature representations.
Shared Feature Extractor: Stacks of 1-D convolutional blocks (Conv-TasNet style) extract features for both TSE and classification.
Mask Estimator: Conditions shared features on the query embedding to estimate extraction masks.
Decoder: Applies the mask to reconstruct the target sound in the time domain.
Classifier: BiGRU-based module estimates the existence probability of each sound class in the mixture.

During inference, the classifier's output is used to refine the query: classes with estimated probability below a threshold $\theta$ are removed. This process is implemented via element-wise multiplication of the binarized classifier output and the original query vector.

Figure 2: Example of performance degradation under the PMQ condition using baseline 1, showing erroneous extraction when an inactive class is included in the query.

The training regime uses only FMQ conditions, optimizing a weighted sum of TSE loss (negative thresholded SNR) and classification loss (binary cross-entropy). This design maximizes extraction performance while enabling robust query refinement at inference.

Experimental Results

Experiments were conducted using a dataset with class label-based queries, evaluating performance under FMQ, PMQ, and FUQ conditions. The main metrics are SNR improvement (SNRi) for extraction and attenuation ratio for silence approximation.

Key findings:

FMQ Condition: The proposed method matches baseline performance, indicating that multi-task learning does not impair TSE.
PMQ Condition: Conventional methods suffer significant SNRi degradation as the number of inactive classes increases. The proposed query refinement method maintains high SNRi, demonstrating robustness to query noise.
FUQ Condition: Training with IS yields the best silence approximation, but query refinement also improves performance by replacing the query with a zero vector.
Figure 3: Relationship between $n_{\text{inactive}$ and SNRi under PMQ conditions, showing the effectiveness of query refinement in maintaining extraction performance as query noise increases.

The classifier achieves a Macro F1 score of ~0.65, and the additional computational cost is modest (4.7% increase in MACs, 4.6% in parameters). The trade-off is that false negatives in classification can degrade FMQ performance by erroneously removing active target classes from the query. Lowering the threshold $\theta$ mitigates this risk.

Trade-offs, Limitations, and Implications

The main trade-off is between robustness to PMQ/FUQ conditions and optimal FMQ performance. The classifier's accuracy, particularly the false negative rate, is critical; improvements in sound event detection could further enhance query refinement. The architecture is efficient and scalable, suitable for real-time applications with minimal overhead.

Practically, the method enables more user-friendly TSE systems that tolerate imperfect queries, a necessity for deployment in consumer devices. Theoretically, it establishes a framework for integrating context-aware inference in query-conditioned source separation.

Future Directions

Potential future work includes:

Leveraging temporal information from sound event detection for frame-level query refinement.
Addressing intra-clip PMQ conditions where class activity varies within an audio segment.
Exploring more sophisticated classifier architectures or pre-trained audio foundation models to improve query refinement accuracy.
Extending the approach to multi-modal queries (e.g., text, audio, and visual cues).

Conclusion

The paper presents a context-aware query refinement method for TSE, addressing the overlooked PMQ condition and demonstrating robust extraction performance under diverse query scenarios. The approach is efficient, generalizes across FMQ, PMQ, and FUQ conditions, and is readily applicable to real-world systems. The main limitation is the dependency on classifier accuracy, but the framework provides a solid foundation for future advances in context-aware source separation.