FSD50K: An Open Dataset of Human-Labeled Sound Events (2010.00475v2)

Published 1 Oct 2020 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

Citations (395)

View on Semantic Scholar

Summary

The paper presents FSD50K as a comprehensive, open dataset for sound event recognition using 51,197 audio clips from Freesound.
It details a rigorous annotation process combining crowdsourcing and expert validation to minimize label noise across 200 sound classes.
Experimental evaluations reveal that audio-specific models outperform generic architectures, highlighting the benefits of tailored SER methodologies.

An Overview of "FSD50K: An Open Dataset of Human-Labeled Sound Events"

The paper presents FSD50K, an open dataset designed to support research in sound event recognition (SER). It is a comprehensive collection of 51,197 audio clips taken from the Freesound platform, encompassing a wide range of everyday sounds. The primary motivation for developing FSD50K was to address the limitations of existing datasets in the SER domain, particularly regarding their accessibility and licensing issues. FSD50K is distinguished by its open nature, leveraging Creative Commons-licensed audio, which allows free distribution of the dataset.

Dataset Characteristics

FSD50K represents 200 classes based on the AudioSet Ontology, an established hierarchical taxonomy used to categorize sound classes. The dataset's development involved a meticulous process of data acquisition, validation, and refinement. The authors underscored the importance of adapting the dataset creation methodology to the specific characteristics of the source data. Freesound's diverse sound clips, combined with detailed ontological structures, facilitated constructing a rich dataset that seeks to balance broad coverage and class specificity.

Annotation Process and Data Quality

Annotation of the sound events in FSD50K was a pivotal task, involving both crowdsourcing and expert validation to ensure annotation quality. The paper emphasizes the complexity of annotating audio data, particularly given a diverse and large vocabulary. Measures such as manual validation and inter-rater agreement mechanisms were integral to minimizing label noise, although some inherent noise remains due to the challenges of exhaustive annotation.

The dataset is intended to be comprehensive and varied, with balanced representation across classes where feasible. The authors conducted quality assessments to measure potential label noise, identifying and addressing issues such as missing or incorrect labels. This assessment is crucial as it directly affects the reliability of evaluations performed using the dataset.

Experimental Evaluation

The authors benchmarked multiple sound event tagging models using the FSD50K dataset. The results highlight that smaller, audio-informed models often outperformed larger, conventional computer vision architectures. This observation suggests that audio-specific considerations and architectural choices are critical in achieving optimal performance for SER tasks.

FSD50K also provides insights into dataset contamination and its effects on validation processes. An experimental setup revealed the impact of within-class and between-class contamination, offering guidance for data splitting strategies in SER tasks.

Practical Implications and Future Directions

The FSD50K dataset is designed to serve as a robust benchmark for SER research. Its openness, comprehensive annotations, and compatibility with the AudioSet Ontology position it as a valuable resource for the community. The authors highlight potential applications, including multi-label sound event classification, domain adaptation, and hierarchical classification studies.

Looking ahead, opportunities for dataset expansion and vocabulary enhancement are anticipated. The authors foresee the application of semi-automatic methods and models trained on FSD50K to further scale the dataset efficiently, addressing ongoing challenges in audio annotation and dataset curation.

In summary, FSD50K stands as a noteworthy contribution to the sound event recognition field, offering an open, human-labeled dataset that addresses the specific challenges of accessibility, completeness, and quality assurance in sound datasets. It sets a substantive groundwork for future advancements in machine listening research.

PDF Markdown