SS-DPPN: Dual-Path Prototypical Network
- The paper introduces SS-DPPN, a dual-path framework combining a dilated TCN for 1D waveforms and a modified ResNet-50 for spectrograms to enhance cardiac audio analysis.
- It employs a hybrid contrastive loss that integrates NT-Xent and Wasserstein metrics to achieve data efficiency, robust calibration, and high sensitivity in diverse diagnostic tasks.
- Through prototypical metric learning, SS-DPPN reduces reliance on expert-annotated data while delivering transferable representations across various physiological signal applications.
The Self-Supervised Dual-Path Prototypical Network (SS-DPPN) is a framework for foundation-level representation learning and robust classification in physiological time-series analysis, with an emphasis on cardiac audio signals. Distinguished by its dual-path contrastive architecture and metric-learning paradigm, SS-DPPN both maximizes data efficiency and delivers well-calibrated, sensitive predictions across multiple domains, using only unlabeled data. The network's design addresses the constraints of expert-annotated datasets and demonstrates highly transferable representations for downstream tasks involving audio-based medical diagnosis.
1. Dual-Path Contrastive Architecture
SS-DPPN employs a dual-path strategy that simultaneously exploits complementary information from raw 1D waveforms and 2D spectrograms. The first path is a dilated Temporal Convolutional Network (TCN) operating on raw cardiac waveforms , leveraging deep residual dilated convolutions to capture long-term temporal dependencies within physiological signals: Each convolutional block implements preactivation normalization and residual skip connections for stable optimization.
The parallel path utilizes a modified ResNet-50 architecture tuned for single-channel spectrogram input. Feature extraction is followed by a projection head: The two learned embeddings and are concatenated, then fused via a multi-layer perceptron:
Self-supervised contrastive learning is realized by constructing positive pairs through random augmentations (Gaussian noise, time shifting, SpecAugment) and contrasting them against hard negatives (other samples). This strategy ensures invariance to signal variability while capturing structural signal features essential for medical diagnostics.
2. Hybrid Contrastive Loss Formulation
SS-DPPN introduces a hybrid loss combining the normalized temperature-scaled cross-entropy (NT-Xent) and the Wasserstein distance. The NT-Xent loss is defined as: where is the cosine similarity, and is a temperature parameter.
The Wasserstein loss: aligns the global embedding distributions of the dual-path outputs. The hybrid loss is then: with controlling the trade-off between local instance discrimination and global geometric alignment; in experimental evaluations, was optimal.
This hybrid criterion ensures SS-DPPN learns instance-level invariance and maintains a globally well-structured latent space for robust downstream metric-based classification.
3. Prototypical Network Metric-Learning Classifier
For downstream prediction, SS-DPPN integrates a prototypical network classifier. During inference, class prototypes are computed as mean embeddings of the support set for each class : A query sample is assigned probabilities via a softmax over negative squared Euclidean distances:
This metric-based classifier inherently yields higher sensitivity and well-calibrated scores compared to standard parametric classifiers, particularly under clinically relevant class imbalances. Quantitative metrics (Brier score, Expected Calibration Error) corroborate the improved reliability of SS-DPPN predictions.
4. Empirical Performance and Benchmark Results
SS-DPPN sets state-of-the-art results on four cardiac audio benchmarks, as evidenced by:
| Dataset | Accuracy | F1-Score | Precision | Recall |
|---|---|---|---|---|
| CirCor DigiScope 2022 | 0.910 | 0.868 | 0.848 | 0.890 |
| PhysioNet Challenge 2016 | 0.881 | 0.922 | — | — |
| PASCAL CHSC | — | 0.831 | — | — |
| HLS-CMDS (simulated) | — | 0.970 | — | — |
SS-DPPN surpasses supervised and prior self-supervised baselines in key diagnostic metrics. Recall and sensitivity, critical for clinical screening, are notably improved by the metric-learning approach. Calibration metrics confirm the model produces trustworthy probability estimates required in medical deployment scenarios.
5. Data Efficiency
SS-DPPN's self-supervised pre-training achieves substantial label savings. Empirical results show that with only 25% labeled data, SS-DPPN matches or exceeds the accuracy of a fully supervised baseline trained on 75% labeled data—a three-fold reduction in annotation requirements. This is highly consequential for medical applications, where expert labeling is expensive and often limited.
6. Generalization to Related Domains
Representations learned by SS-DPPN are highly transferable. With minor fine-tuning, cardiac-trained encoders achieve an F1-score of 0.897 and AUROC of 0.947 for lung sound classification. For heart rate estimation (ECG regression), the mean absolute error is approximately 0.9743 BPM (R² = 0.9705). This suggests SS-DPPN learns fundamental bioacoustic features beneficial for broad physiological signal modeling.
7. Experimental Findings and Ablation Analysis
Systematic ablation verifies the necessity of each component. Replacing prototypical classification with a linear classifier reduces recall from 0.890 to 0.758. Calibration, reliability, and learning dynamics across datasets further reinforce the robustness of SS-DPPN. Embedding visualizations (t-SNE, UMAP) confirm natural clustering by class. Rigorous statistical validation (DeLong’s test, bootstrap sampling, McNemar’s test) indicates improvements are statistically significant.
Summary and Significance
SS-DPPN defines a foundation model for self-supervised, dual-path learning in cardiac audio analysis. Its architecture combines complementary temporal and spectral encoders, a hybrid contrastive loss, and a prototypical metric-learning head, resulting in highly discriminative, calibrated, transferrable representations. The framework demonstrates state-of-the-art accuracy, superior clinical sensitivity, reduced label dependence, and broad generalization across physiological signals, establishing a robust basis for future work in annotation-scarce medical AI (Muna et al., 12 Oct 2025).