- The paper introduces a novel five-class cough classification system that fine-tunes Whisper and employs dual-encoder cross-attention fusion.
- It leverages active-frame QKV attention pooling and advanced training strategies, such as Mixup, FiLM, and supervised contrastive loss, to mitigate class imbalance and acoustic ambiguity.
- The proposed model achieves up to 85.4% balanced accuracy with low-latency server-side inference, enabling real-time multi-condition respiratory screening on mobile devices.
CoughSense: Multi-Class Respiratory Disease Recognition via Whisper and Dual-Encoder Fusion
Introduction
This paper introduces CoughSense, a five-class cough sound classification framework, leveraging fine-tuned speech foundation models (OpenAI Whisper) and a dual-encoder architecture with OPERA-CT, incorporating advanced training procedures to address class imbalance and distributional shift. The target is robust, low-latency, multi-condition respiratory screening from short (1–4 s) smartphone cough recordings, providing real-time deployment on mobile devices. The proposed methodology addresses core challenges: severe inter-class acoustic ambiguity, pronounced class imbalance, and cross-domain variability due to the aggregation of four large-scale public datasets.
Dataset and Taxonomy
CoughSense aggregates 18,301 cough recordings sourced from Coswara, CoughVID, Virufy, and West China Hospital Pediatric Cough datasets. Class taxonomy consists of five categories: healthy, COVID-19, asthma/respiratory condition, bronchitis, and pneumonia. The bronchitis and pneumonia minority classes (n=91/82 raw) are exclusively from pediatric clinical recordings; to address this bottleneck, an eight-fold structured augmentation protocol is employed. Severe class imbalance is retained (healthy:pneumonia 19:1), emphasizing the need for explicit balancing strategies in training.
Model Architecture
Whisper Encoder Utilization
For the first time in cough disease classification, Whisper's speech encoder—pretrained on 680k hours of speech—is fine-tuned for the respiratory diagnosis domain. The rationale is rooted in shared production mechanisms (laryngeal airflow, glottal excitation, broadband resonances) between speech and cough, ensuring the transferability of the learned representations. The encoder is trained in two phases: initial head-only training, then full encoder fine-tuning with differential learning rates and cosine annealing.
Active-Frame QKV Attention Pooling
A significant technical contribution is the introduction of active-frame QKV attention pooling, restricting attention to only the initial 200 tokens (≈4 s audio) of Whisper's 1500-token output. This prevents dilution of signal by silence, an effect that would dominate with naïve mean pooling due to the whisper architecture's 30 s input window mismatch. Ablation demonstrates this yields a +5.1 percentage point increase in balanced accuracy—the largest of any single system component.
Training Protocol and Regularization
Training incorporates several mechanisms to handle label scarcity, class imbalance, and domain shift:
- WeightedRandomSampler to ensure balanced batch composition.
- SpecAugment for robust feature-level regularization.
- Balanced Mixup with tailored minority-majority pairing.
- Supervised Contrastive Loss (SupCon) as an auxiliary objective on in-batch labels.
- FiLM Symptom Conditioning uses external clinical symptom vectors (e.g., anosmia), incorporated via feature-wise affine modulation.
- Gradient Reversal Layer (GRL) domain-adversarial branch to encourage domain-invariant features between clinical and crowdsourced settings.
Dual-Encoder Cross-Attention Fusion
CoughSense further introduces a dual-encoder hybrid, fusing Whisper's speech representations with OPERA-CT—a ViT-based respiratory foundation model. The cross-attention block uses Whisper features as queries and OPERA embeddings as keys/values, followed by training only the fusion and classification heads. This configuration yields the highest reported performance in the study.
The primary evaluation metric is balanced accuracy (UAR), particularly suitable given heavy class imbalance. Five-fold cross-validation results are summarized below:
| Model |
Parameters |
Balanced Accuracy (%) |
Macro-F1 |
Macro-AUC |
| ViT-from-scratch |
6.3M |
52.7 |
0.514 |
0.823 |
| EfficientNet-B2 (ImageNet) |
9.1M |
71.2 |
0.694 |
0.892 |
| CoughSense Whisper-tiny |
8.6M |
82.3 |
0.817 |
0.941 |
| Whisper-base |
39.5M |
84.7 |
0.839 |
0.952 |
| Dual-Encoder (Whisper+OPERA) |
93.1M |
85.4 |
0.851 |
0.958 |
CoughSense Whisper-tiny surpasses EfficientNet-B2 by 11.1 points (at equivalent parametric scale), highlighting the superiority of speech domain pretraining for cough acoustics over vision-based or non-domain-pretrained approaches. All five classes attain ≥74% recall, with four of five surpassing 80%. Notably, the COVID-19 class exhibits the lowest recall (74.8%), reflecting persistent challenges with acoustic discriminability from healthy and label imprecision. Bronchitis and pneumonia—augmented paediatric classes—achieve recalls of 80.3% and 82.4% respectively.
Ablation confirms the necessity of each major architectural choice, especially active-frame pooling (+5.1), QKV attention (+2.2), and the cumulative contributions of Mixup, FiLM, and SupCon (each ~0.5–1 point).
Server-side inference achieves latency < 180 ms/recording on consumer CPUs, supporting real-time deployment. All code and benchmark splits are open-sourced, and the CoughSense model is made available for real-time mobile application use (iOS/Android).
Limitations
CoughSense's bronchitis and pneumonia data is pediatric-only; acoustic differences due to age-related anatomy mean further adult-labeled data is required. Substantial portions of the label space are based on self-report rather than confirmed diagnostics, contributing to noisy supervision. Augmentation cannot substitute for true distributional diversity. No tuberculosis class is present, and variability in mobile hardware microphone response may yield inference-time distribution shift not captured in training. All validation is retrospective; clinical deployment requires prospective, population-matched trials.
Implications and Future Directions
CoughSense demonstrates the applicability of speech-pretrained encoders (Whisper) to non-linguistic pathological audio, outperforming vision-pretrained and randomly-initialized baselines for short-event medical acoustic classification. The dual-encoder design provides an effective means of leveraging heterogeneous foundation models (speech and respiratory). The active-frame pooling paradigm is broadly transferable to other short audio tasks using ASR backbones.
From a theoretical perspective, these results reinforce the effectiveness of large-scale speech pretraining for pathological non-speech audio, especially when coupled with domain-adaptive pooling and comprehensive augmentation. Practically, this enables deployable, low-latency, multi-condition audiological screening on mobile devices without bespoke hardware. Prospective work should pursue adult bronchitis/pneumonia data, explicit tuberculosis inclusion, device-specific domain adaptation, and on-device privacy-preserving inference.
Conclusion
CoughSense presents a robust, reproducible, and computationally efficient approach for multi-class cough disease classification, combining speech foundation model fine-tuning, active attention mechanisms, and dual-encoder fusion. The system establishes new performance baselines for smartphone cough screening—reaching 85.4% balanced accuracy—and delineates effective architectural, training, and augmentation paradigms for future medical audio diagnostic research.