Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms (2009.11644v1)

Published 24 Sep 2020 in cs.SD and eess.AS

Abstract: Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. First, we filtered the dataset using our open-sourced cough detection algorithm. Second, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates, and that their expert labels are consistent. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lara Orlandic (7 papers)
  2. Tomas Teijeiro (23 papers)
  3. David Atienza (63 papers)
Citations (236)

Summary

  • The paper introduces a dataset of over 20,000 cough recordings that supports robust ML model development for respiratory diagnosis.
  • It employs a filtering mechanism using 68 audio features and pulmonologist annotations to accurately distinguish cough sounds and label respiratory conditions.
  • The dataset’s diverse, globally-sourced recordings and rigorous COVID-19 validation pave the way for scalable, mobile diagnostic applications.

Analysis and Impact of the COUGHVID Crowdsourcing Dataset for Cough Analysis Algorithms

The paper "The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms" presents a comprehensive dataset, pivotal for advancing ML methods in cough audio signal classification, particularly in the context of diagnosing respiratory conditions like COVID-19. The dataset, with over 20,000 cough recordings, offers a broad spectrum of demographic metadata, enhancing its utility for training models aimed at COVID-19 classification and other respiratory abnormalities.

Dataset Composition and Methodology

The dataset was assembled through a purposeful crowdsourcing application, which ensured a straightforward recording process for the contributors. A unique feature of the COUGHVID dataset is its filtering mechanism, applying a trained cough detection machine learning model to differentiate cough sounds from non-cough sounds effectively. This model utilizes a robust set of 68 audio features derived from frequency and energy envelope peak detection analyses.

Further distinguishing the COUGHVID dataset is the contribution of pulmonologists who annotated over 2,000 recordings with labels indicating specific respiratory anomalies such as dyspnea and wheezing, as well as diagnostic impressions like COVID-19, asthma, and healthy status. This expert annotation provides a rich source of ground truth for developing models with medical relevance.

Significance of the Dataset

The COUGHVID dataset stands out due to its extensive scale and diversity, comprising recordings from a wide demographic pool across different continents. This diversity is crucial for building generalized ML models capable of accurately diagnosing respiratory ailments across various populations. Moreover, for critical applications like COVID-19 screening, this dataset ensures that models are trained on data representative of real-world scenarios.

A key methodological aspect of this dataset involves validating that COVID-19 labeled coughs originated from regions with high infection rates at data collection times. This step bolsters the reliability of the dataset in representing actual COVID-19 cases, thereby enhancing the credibility of ML models developed using it.

Implications and Future Directions

The dataset's impact extends beyond immediate COVID-19 diagnosis tools. It offers foundational data for developing models that can discern among multiple respiratory conditions, potentially leading to a unified diagnostic tool that considers various pathologies. Additionally, the availability of detailed metadata facilitates research into the interplay of demographic factors with respiratory sound characteristics.

Looking forward, the establishment of a private test set underscores a commitment to rigorous evaluation and reproducibility in research outcomes. Such protocols ensure that continued advances stemming from this dataset remain robust and verifiable.

From a practical perspective, the COUGHVID dataset paves the way for mobile-based diagnostic applications, democratizing access to respiratory health assessment tools, particularly in resource-constrained settings. The potential to integrate these models into telemedicine services could significantly enhance remote monitoring and diagnosis capabilities.

Conclusion

Overall, the COUGHVID dataset is a critically valuable resource for the machine learning community, offering both scale and depth. Its application in COVID-19 diagnosis exemplifies its immediate utility; however, its broader implications in respiratory health diagnostics reinforce the dataset's enduring relevance. The methodological rigor and comprehensive nature of data collection and annotation laid out in this paper propel both theoretical exploration in signal processing and tangible improvements in healthcare delivery.

Youtube Logo Streamline Icon: https://streamlinehq.com