COUGHVID: Cough Audio Dataset
- COUGHVID is a comprehensive cough audio dataset featuring 20,000+ recordings with detailed metadata and expert annotations for respiratory diagnostics.
- It employs an automated filtering pipeline using an XGB classifier and 68 audio features to ensure high-quality cough detection.
- Expert annotations and epidemiological cross-validation enhance its utility for machine learning research, COVID-19 screening, and broader respiratory studies.
The COUGHVID dataset is a large-scale, crowdsourced collection of cough audio recordings designed to support research in automatic analysis and classification of cough sounds, with particular emphasis on COVID-19 detection and respiratory disease screening. Developed during the COVID-19 pandemic, the dataset and its accompanying annotation pipeline address the need for a diverse, validated, and extensible cough sound corpus, facilitating advances in respiratory acoustics, machine learning, and deep learning algorithms for digital health applications.
1. Dataset Construction and Demographic Characteristics
The COUGHVID dataset comprises over 20,000 cough recordings collected globally via a web application. The data spans a broad demographic, with reported ages averaging approximately 34.4 years (standard deviation 12.8); gender representation includes 65.5% males and 33.8% females. Subjects are distributed across a wide geographic range; self-reported geolocation information is available (rounded to one decimal to preserve privacy), and cross-referencing with contemporaneous COVID-19 infection statistics confirms that COVID-19–labeled samples predominantly originate from countries experiencing high recent infection rates (Orlandic et al., 2020).
Recordings capture up to 10 seconds of audio using smartphone microphones. Metadata collection is integral, with users prompted to record age, gender, self-reported health status (e.g., COVID-19, symptomatic, healthy), and optionally—geolocation. The dataset reflects both the technical heterogeneity typical of crowdsourced data (microphone type, environment) and substantial class/label imbalances, with relatively fewer COVID-19–positive cases versus healthy controls.
2. Collection Workflow, Filtering, and Data Quality
The web application implements a “one recording, one click” protocol to minimize friction for user submissions. To ensure audio validity and minimize contamination by speech, background noise, or other non-cough artifacts, the dataset employs an automated filtering pipeline based on an open-source cough detection algorithm. This detector uses an eXtreme Gradient Boosting (XGB) classifier trained on both cough and non-cough exemplars; the recordings are first low-pass filtered (6 kHz cutoff) and downsampled to 12 kHz. A comprehensive set of 68 features—spanning duration, spectral energy, and energy envelope peaks (following methods by Pramono et al. and Chatrzarrin et al.)—are extracted and used in classification. Recordings yielding a cough detection score below 0.8 are generally excluded from the main analysis set (Orlandic et al., 2020).
The classifier output provides a probabilistic estimate of the likelihood that a sample contains a true cough, with ROC performance validating its reliability. This automated selection is critical for scaling expert annotation and downstream machine learning tasks.
3. Annotation and Labeling by Clinical Experts
A subset exceeding 2,000 recordings received manual annotation by three experienced pulmonologists. The annotation protocol, implemented as an online spreadsheet with integrated audio playback, required experts to rate several dimensions:
- Signal quality (Good, Ok, Poor, No cough present)
- Cough type (Dry, Wet, Can’t tell)
- Presence of acoustic markers: dyspnea, wheezing, stridor, choking, nasal congestion
- Overall diagnostic impression (URTI, LRTI, obstructive lung disease, COVID-19, healthy) and severity
Recordings for labeling were selected using stratified sampling by self-reported health status (25% COVID, 35% symptomatic, 25% healthy, 15% unknown). To evaluate inter-rater (expert-expert) reliability, 15% of the annotated subset was evaluated by all three experts. Fleiss’ Kappa scores were used to quantify agreement, with moderate reliability observed for some features (e.g., nasal congestion), and lower agreement for diagnosis—highlighting the intrinsic difficulty of audio-only diagnosis (Orlandic et al., 2020).
4. Dataset Validation and Epidemiological Consistency
To validate self-reported COVID-19 (and symptomatic) labels and minimize mislabeling, geolocation metadata was cross-referenced with contemporaneous WHO COVID-19 infection rates, normalized by UN population data. At the time of collection, 94.4% of COVID-19–labeled and 91.3% of symptomatic recordings were confirmed to originate from regions with >20 daily new cases per million in the prior two weeks. This epidemiological triangulation increases the likelihood that positive instances in the dataset correspond to genuine COVID-19 cases rather than spurious entries (Orlandic et al., 2020).
Furthermore, the label consistency of expert annotations relative to self-reported and epidemiologically plausible cases provides a degree of cross-validation, though nontrivial disagreement remains a known challenge.
5. Technical Specifications and Data Format
Each audio file is distributed in either OGG or WEBM format (Opus codec, 48 kHz sample rate), accompanied by a JSON sidecar file encoding metadata. Preprocessing steps during detection include low-pass filtering and resampling; feature extraction encompasses 40 features per [Pramono et al.], 19 energy envelope features, signal duration, and power spectral densities in 8 custom frequency bands. The label data structure supports both categorical and continuous features, facilitating multi-label and regression analyses (Orlandic et al., 2020).
Property | Value(s) | Notes |
---|---|---|
Audio Format | OGG/WEBM (Opus codec), 48 kHz | WEBM/OGG conversion for public release |
Features | 68-dimensional per sample | Combines spectral, temporal, envelope-based |
Labels | Self-report, expert-annotated, automated | Includes cough type, quality, diagnosis |
Metadata | Age, gender, location, health status | Geolocation rounded, privacy-preserving |
6. Applications, Impact, and Research Use-Cases
The COUGHVID dataset is foundational in the machine learning analysis of cough audio for respiratory condition screening. By combining automated filtering, comprehensive metadata, and expert-annotated diagnostic classes, COUGHVID is a primary benchmark for both traditional classifiers (logistic regression, random forests) and state-of-the-art deep neural networks (CNNs, transformers) (Orlandic et al., 2020). The dataset enables:
- Multi-class discrimination (COVID-19, symptomatic, healthy)
- Cough type and severity estimation
- Feature engineering and machine learning pipeline comparisons
- Large-scale population and epidemiologically relevant analyses
Its open-source architecture and transparent curation support extensions, as demonstrated in later research utilizing transfer learning, semi-supervised label refinement, and multi-modal fusion with clinical metadata. COUGHVID supports not only COVID-19 screening but more general research into robust, low-cost, and scalable respiratory diagnostics delivered via smartphones and telehealth platforms.
7. Limitations and Future Directions
The COUGHVID dataset’s reliance on self-reported labels, crowdsourced metadata, and browser-based user recruitment introduces known sources of label noise, demographic imbalance, and variability in acoustic conditions. Inter-expert reliability remains moderate at best for many annotation dimensions, and the prevalence of asymptomatic or ambiguous coughs further complicates robust classification boundaries.
As field standards and benchmarking practices evolve, several directions for future development are clear:
- Expansion with more clinically validated, PCR-confirmed labels to increase ground truth reliability
- Adoption of advanced semi-supervised learning and expert agreement protocols to harmonize annotations and minimize mislabeling
- Integration with multi-modal data (e.g., breathing, speech, clinical metadata)
- Support for continual, active learning frameworks as new variants and conditions emerge
Despite these challenges, the dataset remains an essential resource for algorithmic research into cough analysis and acoustics-driven respiratory disease screening on a global scale.