A Framework for the Robust Evaluation of Sound Event Detection (1910.08440v2)

Published 18 Oct 2019 in eess.AS and cs.SD

Abstract: This work defines a new framework for performance evaluation of polyphonic sound event detection (SED) systems, which overcomes the limitations of the conventional collar-based event decisions, event F-scores and event error rates. The proposed framework introduces a definition of event detection that is more robust against labelling subjectivity. It also resorts to polyphonic receiver operating characteristic (ROC) curves to deliver more global insight into system performance than F1-scores, and proposes a reduction of these curves into a single polyphonic sound detection score (PSDS), which allows system comparison independently from operating points (OPs). The presented method also delivers better insight into data biases and classification stability across sound classes. Furthermore, it can be tuned to varying applications in order to match a variety of user experience requirements. The benefits of the proposed approach are demonstrated by re-evaluating the baseline and two of the top-performing systems from DCASE 2019 Task 4.

Citations (154)

View on Semantic Scholar

Summary

The paper presents a novel framework that overcomes limitations of collar-based metrics by using ROC curves for robust sound event detection evaluation.
It employs detection, ground truth intersection, and cross-trigger tolerance criteria to accurately assess performance across diverse sound classes.
The framework’s PSDS metric facilitates fair system comparisons across varying operating points, revealing detailed insights into performance trade-offs and data biases.

A Comprehensive Framework for Sound Event Detection Evaluation

The paper presents an innovative framework for evaluating polyphonic sound event detection (SED) systems, addressing existing limitations in collar-based metrics, including event F-scores and event error rates. This framework introduces a more resilient approach to defining event detection, mitigating issues arising from subjective labeling. To offer a global perspective on system performance, the framework employs polyphonic receiver operating characteristic (ROC) curves, which are superior to traditional F1-scores. Furthermore, it consolidates these ROC curves into a singular metric termed the Polyphonic Sound Detection Score (PSDS). The PSDS facilitates system comparison across varying operating points (OPs), delivering a comprehensive evaluation of classification stability across sound classes and uncovering potential data biases. Ultimately, this approach allows for flexibility in application, adaptable to different user experience specifications.

Motivations and Methodology

The paper primarily addresses the challenges inherent in SED evaluations, particularly those associated with selecting and consistently applying evaluation criteria across multiple systems. Historical metrics, such as event-wise and segment-wise error rates, often conflate sound event modeling evaluation with OP tuning evaluation. This conflation stems from the pronounced dependency of these metrics on varying OPs. To resolve this, the proposed framework diminishes the impact of OP choices by incorporating ROC curves that provide a classification performance overview over multiple OPs, reminiscent of practices in keyword spotting and speaker recognition.

Additionally, existing methods' reliance on collars – strict timing constraints on the start and end of detected events – often fails to account for subjective variations in labeling sound events, where human labelers may disagree on temporal boundaries. By embracing an intersection-based approach rather than boundary collars for defining true positives (TPs) and false positives (FPs), the framework attains a robustness that can tolerate labeling variability.

Key Aspects of the Proposed Framework

The new evaluation framework encompasses three pivotal criteria for defining TPs, FPs, and cross-triggers (CTs):

Detection Tolerance Criterion (DTC): This criterion filters valid detections based on the percentage of the intersection with ground truth events, aiming to ensure that detections are relevant.
Ground Truth Intersection Criterion (GTC): This criterion identifies accurately detected ground truth events by evaluating the extent of overlap with detected events, aiming for a robust count of TPs.
Cross-Trigger Tolerance Criterion (CTTC): Distinguishes CTs, the subset of FPs matching another sound class, providing insight into dataset biases.

This methodological enhancement enables the delineation of a more accurate, comprehensive picture of system capabilities, beyond the confines of subjective temporal labeling.

Implications and Experimental Validation

The implications of this framework transcend theoretical refinement; they extend to practical evaluation improvements of SED systems in diverse application areas, notably smart home devices and hearing aids. The adaptability of evaluation parameters also empowers users to tailor assessments to various interface experiences, enhancing utility across different domains.

The paper validates this framework by re-evaluating several SED systems participating in the DCASE 2019 Task 4. Results demonstrate the capacity of the PSDS to offer more nuanced insights into performance trade-offs among systems compared to traditional F1-scores. Notably, the framework uncovers similarities in performance among systems that older metrics do not reveal, illustrating the applicability of the PSDS for nuanced, equitable system evaluation.

Future Directions

This framework lays the groundwork for future research directions focusing on fine-tuning detection criteria, further aligning evaluation methods with practical applications. Additionally, exploring extensions to incorporate emerging sound event datasets or consider more diverse soundscapes merits investigation. These developments could further refine the predictive reliability and application robustness of SED systems.

In conclusion, the paper presents a robust, flexible framework for evaluating SED systems, offering substantive improvements over traditional metrics and enabling a more faithful assessment of real-world performance. The framework encourages further refinement and adaptation to meet the evolving needs of sound event detection technology.

PDF Markdown

Related Papers

YouTube

Show All Videos