COUGHVID: Crowdsourced Cough Audio Dataset

Updated 2 September 2025

COUGHVID is a comprehensive, crowdsourced dataset of over 20,000 cough recordings enriched with detailed demographic and expert medical annotations.
It utilizes advanced signal processing and a dual-stage filtering approach—combining automated detection with expert review—to ensure high-quality data.
The dataset underpins ML model benchmarking in respiratory diagnostics, demonstrating robust performance in COVID-19 screening and other respiratory conditions.

The COUGHVID crowdsourcing dataset is one of the most widely used open-access corpora for the study of cough-based audio biomarker analysis, particularly in the context of data-driven respiratory diagnostics and COVID-19 screening. Established in 2020 at EPFL, COUGHVID represents a large-scale, demographically diverse repository of cough sound recordings, enriched with user metadata and expert medical annotation, and designed to facilitate ML research on automated acoustic analysis of coughs for clinical and epidemiological applications (Orlandic et al., 2020). This corpus has become foundational for evaluating and developing signal processing and ML algorithms in the context of COVID-19 and other respiratory conditions.

1. Dataset Scope and Demographics

COUGHVID contains over 20,000 crowdsourced cough audio recordings, including more than 1,000 samples where users claimed a COVID-19 diagnosis (Orlandic et al., 2020). Its demographic metadata includes:

Age, with an average of 34.4 years (σ = 12.8), and gender distribution (65.5% male, 33.8% female).
Self-reported geographic location with precision-limited geolocation (rounded to 0.1°).
Health status (healthy, symptomatic, or COVID-positive) and history of pre-existing respiratory conditions.

The recording protocol was designed for global accessibility, utilizing a web interface that enabled rapid, one-click submissions from a range of devices and locations.

2. Collection Methodology and Data Quality Control

Crowdsourcing via an online interface minimized collection steps and provided explicit instructions to ensure user safety. To improve dataset integrity, COUGHVID employs a two-stage filtering approach:

Automated Cough Detection: All submissions undergo preprocessing (lowpass filtering at 6 kHz and downsampling to 12 kHz), followed by extraction of 68 audio features—comprising 40 features described by Pramono et al., 19 Energy Envelope Peak Detection features (Chatrzarrin et al.), as well as signal length and selected power spectral density (PSD) features (Orlandic et al., 2020). An eXtreme Gradient Boosting (XGB) classifier, hyperparameter-tuned by Tree-structured Parzen Estimators (TPE), is deployed with 10-fold cross-validation to yield a "cough_detected" probability $[0, 1]$ ; samples with probability $\geq 0.8$ are accepted as valid coughs.
Expert Annotation: Over 2,000 recordings were independently labeled by three pulmonologists. Each annotation includes overall sound quality, cough type (wet/dry/unknown), presence of symptoms such as dyspnea, wheezing, stridor, and a clinical impression (e.g., respiratory tract infection, obstructive disease, COVID-19, healthy), along with severity grading (pseudocough, mild, severe, unknown).

Inter-rater agreement was analyzed via stratified overlap (15% of recordings were jointly annotated), supporting rigorous downstream analysis and highlighting the importance of annotation consistency and potential for mislabeling (Orlandic et al., 2022).

3. Signal Processing and Feature Engineering

Key technical details in COUGHVID's pipeline include:

Use of signal preprocessing (lowpass filtering, downsampling) to suppress high-frequency noise and variability, increasing the signal-to-noise ratio for cough identification.
Feature vector composition spans multi-domain representations (time, frequency, energy envelope, and PSD bands) as well as spectral features, supporting discussions in both baseline (Orlandic et al., 2020) and advanced modeling works (Haritaoglu et al., 2022).
For downstream ML applications, features such as mel-frequency cepstral coefficients (MFCCs), spectral contrast, and derived moments provide input for shallow (e.g., logistic regression, random forest) and deep (convolutional neural network, transformer) architectures.

The automated filtering and feature engineering pipeline ensures that the publicly released corpus is suitable for reproducible acoustic research and ML analysis.

4. Machine Learning Benchmarking and Impact

COUGHVID has supported the development and benchmarking of a wide array of ML models, as demonstrated in works such as:

The Virufy study (Chaudhari et al., 2020), leveraging ensemble neural networks (MFCCs, CNN-extracted spectrograms, and clinical feature predictors) and reporting ROC-AUC of 77.1% for COVID-19 detection on the combined Coswara/COUGHVID set.
Large-scale deep learning (Haritaoglu et al., 2022), where self-supervised transformer models and CNNs are trained on aggregated sets with COUGHVID as a core resource, reporting ROC-AUCs of 0.807 (SSL) and 0.802 (CNN) for COVID-19 classification.
Demonstrations of baseline performance, e.g., random forest and SVM models based on hand-crafted features, achieving moderate but robust results.

Emergent from these studies is the pivotal role of dataset scale and diversity—performance and generalizability increase as more varied and demographically comprehensive data are included. Performance consistently degrades when training set size is artificially reduced, indicating the necessity of large-scale corpus aggregation (Haritaoglu et al., 2022).

5. Semi-Supervised Re-labeling and Dataset Consistency

COUGHVID's utility is further enhanced by efforts to mitigate imperfect labeling and expert disagreement. A semi-supervised learning (SSL) approach leverages multiple expert annotations:

Individual expert models are trained and used to propagate high-confidence pseudo-labels to the unlabeled or ambiguous samples.
Aggregation schemes (universal, expert, majority) are evaluated; the majority agreement yields the optimal compromise between coverage and label consistency (Orlandic et al., 2022).
The SSL pipeline results in a threefold improvement in class separability and distinct spectral differences—statistically significant in the 1–1.5 kHz band ( $p = 1.2 \times 10^{-64}$ )—between COVID-19 and healthy coughs, as measured by power spectral density analysis.

Training classifiers on the SSL-re-labeled subset leads to a 29.5% improvement in AUC over models using only user-provided labels. This methodology supports the creation of higher-quality training data for diagnostic ML models, with general applicability to other medical sound classification tasks.

6. Comparative Assessment and Future Prospects

Relative to other datasets (e.g., Coswara, Sound-Dr, UK COVID-19 Vocal Audio), COUGHVID is primarily cough-centric with less extensive per-recording metadata, shorter average recording durations, and a lower sampling rate (typically 22,050 Hz) (Hoang et al., 2022). However, its demographic breadth and expert-labeled subset have made it central to benchmarking in the field.

Improvements suggested for future releases and related efforts include:

Expansion and refinement of medical expert annotations to standardize label quality.
Collection of multi-modal data (e.g., breathing, speech) and richer clinical metadata, as exemplified by Sound-Dr and the UK COVID-19 Vocal Audio Dataset (Budd et al., 2022).
Integration of uncertainty estimation, cost-sensitive learning, and deep ensemble approaches to address data imbalance and prediction reliability (Chang et al., 2022).

As evidenced by incremental transfer learning studies (Vhaduri et al., 2023), even models initially trained on healthy coughs can be rapidly adapted to COVID-19 detection scenarios as annotated patient data become available, reducing the barrier to rapid deployment in emergent situations.

7. Significance and Limitations

The COUGHVID crowdsourcing dataset is foundational for research in audio-based respiratory health diagnostics. Its global scale, real-world diversity, and inclusion of expert-labeled subsets enable robust training and evaluation of ML models for cough analysis and disease classification. Nonetheless, it is constrained by the limitations of self-reported labels and annotation discordance—a challenge partially mitigated by SSL relabeling and advances in feature analysis.

COUGHVID's open-access model catalyzed international research and underpinned the development of numerous algorithms, tools, and benchmarks in the ongoing challenge of COVID-19 and respiratory disease identification. Continuing efforts to expand data quality, coverage, and annotation consistency will further consolidate its role in the bioacoustic and ML research communities.