Sound Event Bounding Boxes (2406.04212v1)

Published 6 Jun 2024 in eess.AS and cs.SD

Abstract: Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-level thresholding degrades the prediction of the event extent by coupling it with the system's sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing SEBBs, which format each sound event prediction as a tuple of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge systems, boosting the state of the art from .644 to .686 PSDS1.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces SEBBs that decouple temporal extent and confidence predictions, significantly enhancing sound event detection accuracy.
It proposes a change-detection algorithm combined with a hybrid traditional method, resulting in a PSDS increase from 0.644 to 0.703 and an F-score of 0.734.
The experimental improvements suggest practical benefits for applications like surveillance and autonomous driving, reinforcing the shift from frame-level to event-level detection.

Sound Event Bounding Boxes

The paper "Sound Event Bounding Boxes" authored by Janek Ebbers, François G. Germain, Gordon Wichern, and Jonathan Le Roux, offers a critical evolution in the domain of Sound Event Detection (SED). Presented by researchers from Mitsubishi Electric Research Laboratories, the work challenges the conventional frame-level thresholding methodologies traditionally employed in SED. The paper proposes the introduction of Sound Event Bounding Boxes (SEBBs), which, through decoupling the predictions of event extent and confidence, fundamentally alter the mechanisms and improve the accuracy of SED systems.

Core Contributions

The fundamental contribution of this paper lies in proposing SEBBs as a new output format for SED systems. Traditional SED methods involve predicting frame-level sound presence confidences, which are then thresholded to create binary frame-level decisions that are merged to form event predictions. However, this coupling of threshold-based event extent prediction with confidence levels has been identified as sub-optimal. The paper introduces SEBBs to completely decouple these predictions, presenting a sound event as a tuple consisting of class type, temporal extent, and confidence score.

Methodological Advancements

SEBBs:
- SEBBs format sound events as a tuple $(c_j, t_{\text{on}, j}, t_{\text{off}, j}, \overline{y}_j)$ , where $c_j$ denotes the class of the event, $t_{\text{on}, j}$ and $t_{\text{off}, j}$ denote the onset and offset times, and $\overline{y}_j$ represents the aggregated confidence score.
- This structure separates the temporal extent and confidence predictions, making event confidence thresholding independent of temporal extent predictions.
Change-Detection Algorithm:
- To transition legacy systems to output SEBBs, the authors propose a change-detection-based algorithm. This method uses delta scores involving ideal step filters to identify onsets and offsets for the events more robustly than simple thresholding methods.
- Further, a hybrid method incorporating both traditional and change-detection approaches is introduced to enhance the flexibility and applicability of SEBB prediction.

Experimental Results

The proposed methodologies were rigorously evaluated using the DCASE 2023 Challenge Task 4 dataset. The results yielded significant improvements over traditional methods:

The best-performing system under the contemporary thresholding approach scored a PSDS (Polyphonic Sound Detection Score) of 0.644. Application of the proposed SEBB methods resulted in an uplift to a PSDS of 0.703 and a corresponding F-score of 0.734.
Systems that adopted SEBBs overall showed an average improvement in PSDS by 4.1 percentage points, while F-scores improved by 3.4 percentage points.

Practical and Theoretical Implications

The implications of this research are substantial for both practical deployments of SED systems and their theoretical underpinnings:

Practical Benefits: In real-world applications such as surveillance, wildlife monitoring, and autonomous driving, the improved accuracy and robustness of SEBBs can directly translate into better system performance and reliability.
Theoretical Insights: The decoupling of event extent and confidence prediction using SEBBs addresses a crucial flaw in the traditional SED frameworks, thereby refining the theoretical models and their evaluation metrics. This work also underlines the necessity of evolving beyond frame-level methods, emphasizing event-level rigor in SED research.

Future Prospects

The paper sets several avenues for future work:

Enhancing the SEBB prediction methodologies with advanced deep learning techniques and exploring end-to-end SEBB training strategies.
Extending the SEBB framework to other audio event detection tasks and evaluating their performance across diverse datasets.
Further improving the robustness of the change-detection algorithm and hybrid models through more sophisticated statistical methods and machine learning techniques.

In conclusion, the introduction of Sound Event Bounding Boxes demonstrates a pivotal stride in enhancing the accuracy and reliability of Sound Event Detection systems by decoupling the complex interdependencies between event extent and detection confidence. Through rigorous methodological advancements and impactful experimental results, this paper extends the envelope of what is achievable in the field of acoustic scene analysis. The paradigm shift that SEBBs represent promises to enrich the arsenal of tools available to researchers and engineers in audio signal processing and related fields.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JonathanLeRoux/status/1799172675471831206