Automated Cough Detection
- Automated cough detection is an interdisciplinary field that employs biosensors, signal processing, and machine learning to identify cough events accurately.
- It integrates acoustic and inertial sensor data with preprocessing steps and advanced feature extraction to achieve high sensitivity and specificity.
- This technology enables efficient, on-device respiratory monitoring for clinical applications, public health surveillance, and personalized medicine.
Automated cough detection refers to the algorithmic identification and segmentation of cough events from continuous sensor data streams. This is a multidisciplinary field intersecting biomedical signal processing, embedded systems, and machine learning. Its primary motivation arises from the clinical importance of objective cough monitoring for respiratory disease management, public health surveillance, and personalized medicine.
1. Sensor Modalities and Signal Acquisition
Modern automated cough detection systems leverage a broad spectrum of biosensor types for signal acquisition. The dominant modality remains acoustic—microphones (MEMS, condenser, lapel, throat-contact) are deployed in smartphones, wearables (earbuds, smartwatches), ambulatory monitors, or hospital/bedside recorders (Drugman et al., 2020, Jaiswal et al., 2024, Pahar et al., 2021, Ijaz et al., 2023). Microphone-based approaches offer high-fidelity cough signatures and straightforward integration with commercial hardware. However, privacy concerns, non-stationary ambient noise, and power/compute constraints have spurred interest in alternative and complementary sensors.
Kinematic sensing via inertial measurement units (IMUs: accelerometers, gyroscopes) enables energy-efficient, privacy-preserving “cough activator” pipelines running on commercial wearables and non-invasive room sensors (Zhang et al., 2021, Zhang et al., 2021, Albini et al., 2024, Pahar et al., 2021). Ancillary biosignals (ECG, thermistors for airflow, respiratory chest belts, contact microphones) provide further discriminatory power, especially in clinical or ICU settings (Drugman et al., 2019, Soliński et al., 2019). Hybrid multimodal architectures, such as Cough-E, combine microphones with IMUs to jointly optimize power, specificity, and real-time latency (Albini et al., 2024).
Recording protocols critically affect data diversity and quality. Controlled clinical and laboratory studies (e.g., instructions to cough, annotated non-coughs) contrast with unconstrained “free-living” or crowdsourced recordings, which offer ecological validity but introduce substantial class imbalance and annotation complexity (Jaiswal et al., 2024, Chaudhari et al., 2020, Leamy et al., 2019).
2. Preprocessing and Feature Extraction
Canonical preprocessing for audio encompasses amplitude normalization, bandpass filtering (typ. 300–3 kHz), framing into short overlapping windows (10–40 ms) with 50% overlap, energy-based voice activity detection, and manual or semi-automatic event segmentation (Drugman et al., 2020, Ijaz et al., 2023). Data augmentation—additive noise, pitch/time shifts, or advanced synthesis—is deployed both for class balancing and robustness to real-world acoustic variability (Jaiswal et al., 2024). Similar pipeline elements are employed for IMU streams: moving-average filtering, Butterworth high-pass filtering for drift removal, and segmentation into short windows (e.g., 0.4 s at 50 Hz) (Zhang et al., 2021, Zhang et al., 2021).
Feature engineering remains central. Widely adopted features for audio include Mel-frequency cepstral coefficients (MFCC, static + delta + delta-delta), log-Mel and log-energy spectrograms, zero-crossing rate, spectral centroid, flux, roll-off, harmonic-to-noise ratio, and Bark/Gammatone energies (Drugman et al., 2020, Leamy et al., 2021, Ijaz et al., 2023, Chaudhari et al., 2020). For kinematic data, salient features encompass root mean square, crest factor, kurtosis, zero-crossing rate, spectral coefficients via short-time Fourier transform, and various time-domain statistics (Albini et al., 2024, Pahar et al., 2021). Airflow signal features for spirometry-integrated systems comprise counts of post-peak flow spikes, crossings at fractional PEF (peak expiratory flow), and pronounced local maxima (Soliński et al., 2019).
Advanced systems apply feature selection via mutual information criteria or recursive feature elimination to reduce redundancy and computational burden while maximizing discriminability (Drugman et al., 2020, Albini et al., 2024). Time–frequency decompositions (STFT, wavelet, spectral subspaces) and deep-learned representations (e.g., through CNN front-ends) are increasingly prevalent (Leamy et al., 2021, Jaiswal et al., 2024, Vüren et al., 11 Mar 2026).
3. Machine Learning Architectures
The field exhibits a progression from classical machine learning (k-NN, SVM, GMM, HMM, logistic regression, random forest) to deep neural paradigms (CNN, LSTM, RNN, hybrid CNN+LSTM, transformer-based models). Frame-level classifiers using engineered features and lightweight architectures (single-layer ANN, GMM with diagonal covariances) remain highly energy- and compute-efficient, achieving event-level sensitivity and specificity >94% in controlled studies (Drugman et al., 2020, Drugman et al., 2019). Hidden Markov Models exploit the temporal structure of cough (explosive-intermediate-voiced phases), with multivariate energy-band features outperforming univariate summary energy (Teyhouee et al., 2019).
Convolutional architectures leverage spectrogram/MFCC “images” as inputs, with compact models (≤200k parameters) suitable for on-device inference on mobile SoCs (Bales et al., 2020, Jaiswal et al., 2024). Recurrent and hybrid models (LSTM, CNN+LSTM) capture cough-burst dynamics; transformers (e.g., XLS-R with reduced layers) deliver state-of-the-art performance on boundary segmentation and large-scale audio event detection, including in highly noisy clinical settings (Vüren et al., 11 Mar 2026).
Template matching with dynamic time warping (DTW) underpins a class of interpretable, low-power classifiers for IMU and accelerometer time-series, as exemplified by CoughTrigger and self-tuning multi-centroid classifiers. These approaches enable real-time operation with controllable sensitivity (via post-hoc thresholding or cluster template count) and sub-ms MCU inference latency (Zhang et al., 2021, Zhang et al., 2021).
Multimodal and cooperative-trigger systems (Cough-E, CoughTrigger) layer lightweight IMU classifiers as always-on monitors, activating higher-power audio branches only when needed, thus achieving >70% energy savings compared to monolithic audio pipelines with minimal F1 penalty (Zhang et al., 2021, Albini et al., 2024).
4. Evaluation Metrics and Frameworks
Event-based evaluation is widely recognized as necessary for clinical relevance (Orlandic et al., 2024). Standard sample- or window-based metrics (accuracy, specificity, ROC-AUC, precision, recall) are misleading in highly imbalanced datasets, obscuring the true performance in cough occurrence detection and temporal resolution. Event-based metrics operate on predicted and reference (onset, offset) pairs with defined temporal tolerances (e.g., ±0.25 s), reporting event-precision, event-recall, F1, and false positives per hour:
- Event-precision:
- Event-recall:
- Event-F1:
Event-counting and segmentation methods have been codified into open-source frameworks such as the SzCORE time-scoring library, facilitating standardized benchmarking across algorithms and datasets (Orlandic et al., 2024).
Cough detection systems are further evaluated by downstream clinical outcomes, such as the Area Under ROC Curve (AUC) for disease classification tasks (e.g., COVID-19, tuberculosis), with robust systems maintaining AUCs >0.7 under substantial noise and device variability (Chaudhari et al., 2020, Vüren et al., 11 Mar 2026).
5. Real-World Deployments and Embedded Implementation
Power and computational constraints are primary design drivers for wearable and edge-AI cough detectors. Systems such as CoughTrigger (Galaxy Buds2) and Cough-E (ARM Cortex-M33) demonstrate that through modality selection (IMU activation of audio), feature pruning, and model quantization (e.g., INT8), multi-day, privacy-preserving continuous deployment is feasible with <1% battery penalty and event-level F1 ≈0.78–0.88 (Zhang et al., 2021, Albini et al., 2024). Audio-only systems running framewise GMM or single-layer ANN at ≤100 MFLOPS are sufficient for 100 fps real-time detection (Drugman et al., 2020, Drugman et al., 2019).
Edge deployment entails pipeline optimizations ensuring all feature computation, inference, and raw data discard are confined to the embedded device, thereby realizing on-device privacy guarantees—a non-negotiable criterion for medical acceptability (Albini et al., 2024). Continuous cough monitors must also be robust to confounders (laughter, speech, forced expiration) and survive in situ artifact rates.
Scaling to global, population-level monitoring crucially depends on open, diverse datasets and platform-agnostic model designs. Ecosystem-wide standards for annotation, device calibration, and metric reporting remain an active area of development, as public corpora such as AMI, COUGHVID, and Coswara are increasingly leveraged in both academic and translational contexts (Leamy et al., 2019, Chaudhari et al., 2020, Ijaz et al., 2023).
6. Current Limitations and Research Trends
While deep models have demonstrated near-human accuracy in controlled and moderately noisy environments, persistent challenges include generalization to unseen subjects and contexts, high-SNR confounders, and short-cough detection (<100 ms duration). IMU-only modalities suffer significant drop in sensitivity when cough motion is subtle or obscured by motion artifacts (Zhang et al., 2021, Zhang et al., 2021). Acoustic-based algorithms remain susceptible to dense background sound unless equipped with advanced denoising and robust feature sets (Chaudhari et al., 2020, Vüren et al., 11 Mar 2026).
Continued progress involves: (1) multimodal sensor fusion, (2) federated/self-supervised learning for privacy-preserving model adaptation, (3) interpretable ML for clinical decision support, (4) edge-aware model compression and adaptation, (5) standardization of event-based benchmarking, and (6) clinical integration with telemedicine workflows (Albini et al., 2024, Ijaz et al., 2023, Orlandic et al., 2024).
Integration of large-scale pretrained models (e.g., truncated XLS-R) enables leveraging audio representations distilled from massive corpora for robust boundary detection with state-of-the-art precision and recall, now feasible on modern mobile hardware (Vüren et al., 11 Mar 2026). Concurrently, low-power IMU-centric and MMU-triggered pipelines pave the way for battery-efficient, always-on cough monitors in consumer wearables.
7. Reference Datasets and Open Tools
Robust automated cough detector development depends on open, diverse, and expertly annotated datasets. The re-annotated AMI corpus provides high-precision event boundary labels for over 1,300 cough events in naturalistic meeting environments (Leamy et al., 2019). COUGHVID and Coswara offer large, crowdsourced collections with PCR-confirmed COVID-19 tags (Chaudhari et al., 2020, Ijaz et al., 2023). Toolkits for event-based evaluation (SzCORE), model deployment (open-source C/C++/Python on edge MCUs), and annotation (MATLAB GUI for segmentation) are publicly released to facilitate transparent benchmarking and reproducibility (Orlandic et al., 2024, Albini et al., 2024, Leamy et al., 2019).
Best practices now emphasize event-based metrics, open architectures, parameter-efficient models, and on-device privacy-by-design, establishing rigorous new standards for clinical and consumer respiratory monitoring solutions.