SS-DPPN: Dual-Path Prototypical Network

Updated 19 October 2025

The paper introduces SS-DPPN, a dual-path framework combining a dilated TCN for 1D waveforms and a modified ResNet-50 for spectrograms to enhance cardiac audio analysis.
It employs a hybrid contrastive loss that integrates NT-Xent and Wasserstein metrics to achieve data efficiency, robust calibration, and high sensitivity in diverse diagnostic tasks.
Through prototypical metric learning, SS-DPPN reduces reliance on expert-annotated data while delivering transferable representations across various physiological signal applications.

The Self-Supervised Dual-Path Prototypical Network (SS-DPPN) is a framework for foundation-level representation learning and robust classification in physiological time-series analysis, with an emphasis on cardiac audio signals. Distinguished by its dual-path contrastive architecture and metric-learning paradigm, SS-DPPN both maximizes data efficiency and delivers well-calibrated, sensitive predictions across multiple domains, using only unlabeled data. The network's design addresses the constraints of expert-annotated datasets and demonstrates highly transferable representations for downstream tasks involving audio-based medical diagnosis.

1. Dual-Path Contrastive Architecture

SS-DPPN employs a dual-path strategy that simultaneously exploits complementary information from raw 1D waveforms and 2D spectrograms. The first path is a dilated Temporal Convolutional Network (TCN) operating on raw cardiac waveforms $x \in \mathbb{R}^T$ , leveraging deep residual dilated convolutions to capture long-term temporal dependencies within physiological signals: $\begin{aligned} h_0 &= f_\text{initial}(x) \ h_{l+1} &= f_{\text{block},l}(h_l)\ \ \text{for}\ l=0,\ldots,7 \ z_{1\text{D}} &= f_\text{pool}(h_8) \end{aligned}$ Each convolutional block implements preactivation normalization and residual skip connections for stable optimization.

The parallel path utilizes a modified ResNet-50 architecture tuned for single-channel spectrogram input. Feature extraction is followed by a projection head: $P(h) = W_p \cdot \text{ReLU}(\text{BN}(W_e \cdot h))$ The two learned embeddings $z_{1\text{D}}$ and $z_{2\text{D}}$ are concatenated, then fused via a multi-layer perceptron: $Z_\text{fused} = F_\text{fusion}([z_{1\text{D}}, z_{2\text{D}}])$

Self-supervised contrastive learning is realized by constructing positive pairs through random augmentations (Gaussian noise, time shifting, SpecAugment) and contrasting them against hard negatives (other samples). This strategy ensures invariance to signal variability while capturing structural signal features essential for medical diagnostics.

2. Hybrid Contrastive Loss Formulation

SS-DPPN introduces a hybrid loss combining the normalized temperature-scaled cross-entropy (NT-Xent) and the Wasserstein distance. The NT-Xent loss is defined as: $L_\text{NT-Xent}(z_i, z_j) = -\log \left( \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)} \right)$ where $\text{sim}(\cdot, \cdot)$ is the cosine similarity, and $\tau$ is a temperature parameter.

The Wasserstein loss: $L_W(\mu, \nu) = \inf_{\gamma \in \Gamma(\mu, \nu)} \int \|z_a - z_b\|^2 d\gamma(z_a, z_b)$ aligns the global embedding distributions of the dual-path outputs. The hybrid loss is then: $L_\text{Hybrid} = \alpha L_W + (1-\alpha)L_\text{NT-Xent}$ with $\alpha$ controlling the trade-off between local instance discrimination and global geometric alignment; in experimental evaluations, $\alpha = 0.3$ was optimal.

This hybrid criterion ensures SS-DPPN learns instance-level invariance and maintains a globally well-structured latent space for robust downstream metric-based classification.

3. Prototypical Network Metric-Learning Classifier

For downstream prediction, SS-DPPN integrates a prototypical network classifier. During inference, class prototypes are computed as mean embeddings of the support set for each class $c$ : $p_c = \frac{1}{|S_c|} \sum_{(x_i, y_i) \in S_c} f_\phi(x_i)$ A query sample $x_q$ is assigned probabilities via a softmax over negative squared Euclidean distances: $p_\phi(y=c | x_q) = \frac{ \exp(-\|f_\phi(x_q) - p_c\|^2) }{ \sum_{c'} \exp(-\|f_\phi(x_q) - p_{c'}\|^2) }$

This metric-based classifier inherently yields higher sensitivity and well-calibrated scores compared to standard parametric classifiers, particularly under clinically relevant class imbalances. Quantitative metrics (Brier score, Expected Calibration Error) corroborate the improved reliability of SS-DPPN predictions.

4. Empirical Performance and Benchmark Results

SS-DPPN sets state-of-the-art results on four cardiac audio benchmarks, as evidenced by:

Dataset	Accuracy	F1-Score	Precision	Recall
CirCor DigiScope 2022	0.910	0.868	0.848	0.890
PhysioNet Challenge 2016	0.881	0.922	—	—
PASCAL CHSC	—	0.831	—	—
HLS-CMDS (simulated)	—	0.970	—	—

SS-DPPN surpasses supervised and prior self-supervised baselines in key diagnostic metrics. Recall and sensitivity, critical for clinical screening, are notably improved by the metric-learning approach. Calibration metrics confirm the model produces trustworthy probability estimates required in medical deployment scenarios.

5. Data Efficiency

SS-DPPN's self-supervised pre-training achieves substantial label savings. Empirical results show that with only 25% labeled data, SS-DPPN matches or exceeds the accuracy of a fully supervised baseline trained on 75% labeled data—a three-fold reduction in annotation requirements. This is highly consequential for medical applications, where expert labeling is expensive and often limited.

Representations learned by SS-DPPN are highly transferable. With minor fine-tuning, cardiac-trained encoders achieve an F1-score of 0.897 and AUROC of 0.947 for lung sound classification. For heart rate estimation (ECG regression), the mean absolute error is approximately 0.9743 BPM (R² = 0.9705). This suggests SS-DPPN learns fundamental bioacoustic features beneficial for broad physiological signal modeling.

7. Experimental Findings and Ablation Analysis

Systematic ablation verifies the necessity of each component. Replacing prototypical classification with a linear classifier reduces recall from 0.890 to 0.758. Calibration, reliability, and learning dynamics across datasets further reinforce the robustness of SS-DPPN. Embedding visualizations (t-SNE, UMAP) confirm natural clustering by class. Rigorous statistical validation (DeLong’s test, bootstrap sampling, McNemar’s test) indicates improvements are statistically significant.

Summary and Significance

SS-DPPN defines a foundation model for self-supervised, dual-path learning in cardiac audio analysis. Its architecture combines complementary temporal and spectral encoders, a hybrid contrastive loss, and a prototypical metric-learning head, resulting in highly discriminative, calibrated, transferrable representations. The framework demonstrates state-of-the-art accuracy, superior clinical sensitivity, reduced label dependence, and broad generalization across physiological signals, establishing a robust basis for future work in annotation-scarce medical AI (Muna et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

SS-DPPN: A self-supervised dual-path foundation model for the generalizable cardiac audio representation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Dual-Path Prototypical Network (SS-DPPN).