CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

Published 2 Jun 2026 in cs.LG and eess.AS | (2606.02998v1)

Abstract: Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a consumer smartphone. We present CoughSense, a system that sorts cough recordings into five classes. These are healthy, COVID-19, asthma or respiratory condition, bronchitis, and pneumonia. We aggregated 18,301 recordings from four public datasets (Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset) and used the OpenAI Whisper encoder as a pretrained backbone for cough disease classification. The main contribution is active-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder tokens. This avoids the silence-dilution problem that arises because a 3-second cough fills only 150 tokens of Whisper's 30-second input window. Other training parts handle the 19 to 1 class imbalance and the four-dataset domain shift. These include WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing, a supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. A dual-encoder model fuses Whisper with the OPERA-CT respiratory foundation model through cross-attention. CoughSense (Whisper-tiny, 8.6M parameters) reached 82.3 percent balanced accuracy on five-fold cross-validation (macro-F1 of 0.817, AUC of 0.941). It beat an ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT trained from scratch by 29.6 points. All five classes passed 74 percent recall and four of five passed 80 percent. The dual-encoder model reached 85.4 percent balanced accuracy. Active-frame pooling is the largest single contributor across all ablation components at 5.1 points, which should help any short-audio task using Whisper as a backbone.

Abstract PDF Upgrade to Chat

Authors (1)

Nikhil Vincent

Summary

The paper introduces a novel five-class cough classification system that fine-tunes Whisper and employs dual-encoder cross-attention fusion.
It leverages active-frame QKV attention pooling and advanced training strategies, such as Mixup, FiLM, and supervised contrastive loss, to mitigate class imbalance and acoustic ambiguity.
The proposed model achieves up to 85.4% balanced accuracy with low-latency server-side inference, enabling real-time multi-condition respiratory screening on mobile devices.

CoughSense: Multi-Class Respiratory Disease Recognition via Whisper and Dual-Encoder Fusion

Introduction

This paper introduces CoughSense, a five-class cough sound classification framework, leveraging fine-tuned speech foundation models (OpenAI Whisper) and a dual-encoder architecture with OPERA-CT, incorporating advanced training procedures to address class imbalance and distributional shift. The target is robust, low-latency, multi-condition respiratory screening from short (1–4 s) smartphone cough recordings, providing real-time deployment on mobile devices. The proposed methodology addresses core challenges: severe inter-class acoustic ambiguity, pronounced class imbalance, and cross-domain variability due to the aggregation of four large-scale public datasets.

Dataset and Taxonomy

CoughSense aggregates 18,301 cough recordings sourced from Coswara, CoughVID, Virufy, and West China Hospital Pediatric Cough datasets. Class taxonomy consists of five categories: healthy, COVID-19, asthma/respiratory condition, bronchitis, and pneumonia. The bronchitis and pneumonia minority classes (n=91/82 raw) are exclusively from pediatric clinical recordings; to address this bottleneck, an eight-fold structured augmentation protocol is employed. Severe class imbalance is retained (healthy:pneumonia 19:1), emphasizing the need for explicit balancing strategies in training.

Model Architecture

Whisper Encoder Utilization

For the first time in cough disease classification, Whisper's speech encoder—pretrained on 680k hours of speech—is fine-tuned for the respiratory diagnosis domain. The rationale is rooted in shared production mechanisms (laryngeal airflow, glottal excitation, broadband resonances) between speech and cough, ensuring the transferability of the learned representations. The encoder is trained in two phases: initial head-only training, then full encoder fine-tuning with differential learning rates and cosine annealing.

Active-Frame QKV Attention Pooling

A significant technical contribution is the introduction of active-frame QKV attention pooling, restricting attention to only the initial 200 tokens ( $\approx$ 4 s audio) of Whisper's 1500-token output. This prevents dilution of signal by silence, an effect that would dominate with naïve mean pooling due to the whisper architecture's 30 s input window mismatch. Ablation demonstrates this yields a +5.1 percentage point increase in balanced accuracy—the largest of any single system component.

Training Protocol and Regularization

Training incorporates several mechanisms to handle label scarcity, class imbalance, and domain shift:

WeightedRandomSampler to ensure balanced batch composition.
SpecAugment for robust feature-level regularization.
Balanced Mixup with tailored minority-majority pairing.
Supervised Contrastive Loss (SupCon) as an auxiliary objective on in-batch labels.
FiLM Symptom Conditioning uses external clinical symptom vectors (e.g., anosmia), incorporated via feature-wise affine modulation.
Gradient Reversal Layer (GRL) domain-adversarial branch to encourage domain-invariant features between clinical and crowdsourced settings.

Dual-Encoder Cross-Attention Fusion

CoughSense further introduces a dual-encoder hybrid, fusing Whisper's speech representations with OPERA-CT—a ViT-based respiratory foundation model. The cross-attention block uses Whisper features as queries and OPERA embeddings as keys/values, followed by training only the fusion and classification heads. This configuration yields the highest reported performance in the study.

Results and Performance Analysis

The primary evaluation metric is balanced accuracy (UAR), particularly suitable given heavy class imbalance. Five-fold cross-validation results are summarized below:

Model	Parameters	Balanced Accuracy (%)	Macro-F1	Macro-AUC
ViT-from-scratch	6.3M	52.7	0.514	0.823
EfficientNet-B2 (ImageNet)	9.1M	71.2	0.694	0.892
CoughSense Whisper-tiny	8.6M	82.3	0.817	0.941
Whisper-base	39.5M	84.7	0.839	0.952
Dual-Encoder (Whisper+OPERA)	93.1M	85.4	0.851	0.958

CoughSense Whisper-tiny surpasses EfficientNet-B2 by 11.1 points (at equivalent parametric scale), highlighting the superiority of speech domain pretraining for cough acoustics over vision-based or non-domain-pretrained approaches. All five classes attain ≥74% recall, with four of five surpassing 80%. Notably, the COVID-19 class exhibits the lowest recall (74.8%), reflecting persistent challenges with acoustic discriminability from healthy and label imprecision. Bronchitis and pneumonia—augmented paediatric classes—achieve recalls of 80.3% and 82.4% respectively.

Ablation confirms the necessity of each major architectural choice, especially active-frame pooling (+5.1), QKV attention (+2.2), and the cumulative contributions of Mixup, FiLM, and SupCon (each ~0.5–1 point).

Server-side inference achieves latency < 180 ms/recording on consumer CPUs, supporting real-time deployment. All code and benchmark splits are open-sourced, and the CoughSense model is made available for real-time mobile application use (iOS/Android).

Limitations

CoughSense's bronchitis and pneumonia data is pediatric-only; acoustic differences due to age-related anatomy mean further adult-labeled data is required. Substantial portions of the label space are based on self-report rather than confirmed diagnostics, contributing to noisy supervision. Augmentation cannot substitute for true distributional diversity. No tuberculosis class is present, and variability in mobile hardware microphone response may yield inference-time distribution shift not captured in training. All validation is retrospective; clinical deployment requires prospective, population-matched trials.

Implications and Future Directions

CoughSense demonstrates the applicability of speech-pretrained encoders (Whisper) to non-linguistic pathological audio, outperforming vision-pretrained and randomly-initialized baselines for short-event medical acoustic classification. The dual-encoder design provides an effective means of leveraging heterogeneous foundation models (speech and respiratory). The active-frame pooling paradigm is broadly transferable to other short audio tasks using ASR backbones.

From a theoretical perspective, these results reinforce the effectiveness of large-scale speech pretraining for pathological non-speech audio, especially when coupled with domain-adaptive pooling and comprehensive augmentation. Practically, this enables deployable, low-latency, multi-condition audiological screening on mobile devices without bespoke hardware. Prospective work should pursue adult bronchitis/pneumonia data, explicit tuberculosis inclusion, device-specific domain adaptation, and on-device privacy-preserving inference.

Conclusion

CoughSense presents a robust, reproducible, and computationally efficient approach for multi-class cough disease classification, combining speech foundation model fine-tuning, active attention mechanisms, and dual-encoder fusion. The system establishes new performance baselines for smartphone cough screening—reaching 85.4% balanced accuracy—and delineates effective architectural, training, and augmentation paradigms for future medical audio diagnostic research.

Markdown Report Issue