HeartCLAP: Audio Alignment in Biomed & Music
- HeartCLAP is a dual-purpose model that processes audio using contrastive pretraining for both heart rate estimation from PCG signals and music-text alignment.
- It utilizes a CLAP audio encoder with a 12-layer ViT backbone and a lightweight MLP regressor to achieve an optimal MAE of 1.88 bpm in heart rate estimation.
- In music information retrieval, HeartCLAP’s dual-encoder contrastive learning significantly improves audio-text retrieval benchmarks over previous CLAP-style models.
HeartCLAP refers to two distinct but foundational models developed for the processing and alignment of audio data within specialized biomedical and music contexts. In biomedical signal analysis, HeartCLAP denotes a heart-rate estimation framework built on Contrastive Language–Audio Pretraining (CLAP) representations for phonocardiogram (PCG) signals (Nie et al., 27 May 2025). In music information retrieval, HeartCLAP designates the dual-encoder alignment model within the HeartMuLa foundation model suite, tailored for robust cross-modal mapping between music audio and textual descriptions (Yang et al., 15 Jan 2026). Both share core design principles regarding audio representation but address separate application domains.
1. HeartCLAP for Heart Rate Estimation from Auscultation
HeartCLAP, as introduced by Wang et al., is a system for estimating heart rate from PCG using hidden representations extracted from an in-house CLAP audio encoder. This system leverages mid-level transformer features to encode periodic events corresponding to heart sounds (S1, S2), outperforming conventional acoustic-feature and automatic speech recognition (ASR)-oriented foundation models.
CLAP Audio Encoder Architecture
- Input: 5-second mono PCG segments, resampled at 16 kHz, transformed to a 1×128×1024 log-Mel spectrogram (128 channels, 25 ms Hann window, 10 ms stride).
- Embedding Layer: Conv2D patch embedding (16×16 kernel, non-overlapping), yielding 512 audio tokens.
- Transformer Backbone: 12-layer, base Vision Transformer (ViT-B), 768 hidden units per layer, 12 self-attention heads, feed-forward size 3072, sinusoidal positional encoding.
- Layer-wise Output: The -th layer produces a matrix , updated as for .
Heart Rate Estimation Pipeline
- Preprocessing: 2 kHz low-pass filtering; 5 s windows with 1 s stride; per-window mean–variance normalization; ground-release labels via S1-onset intervals [Nie et al., 2024].
- Representation Extraction: Obtain for each segment at each layer.
- Pooling and Regression: Global mean-pooling on the token dimension (), followed by a lightweight MLP regressor:
Loss is mean absolute error (MAE).
- Training and Evaluation: Six random 80/10/10 splits (no subject overlap), Adam optimizer (), batch size 32, 50 epochs, early stopping on validation MAE.
Layer-wise Performance
| Layer | MAE (bpm), mean ± std |
|---|---|
| 1 | 2.36 ± 0.42 |
| 2 | 2.12 ± 0.39 |
| 3 | 1.94 ± 0.38 |
| 4 | 1.90 ± 0.37 |
| 5 | 1.89 ± 0.37 |
| 6 | 1.88 ± 0.37 |
| 7 | 1.92 ± 0.39 |
| ... | ... |
| 12 | 2.05 ± 0.44 |
Optimal performance occurs at the 6th layer, achieving MAE = 1.88 bpm (σ = 0.37).
Comparison with Baseline and Foundation Models
| Model | Best MAE (bpm) | Std |
|---|---|---|
| Baseline (mel/MFCC) | 1.91 | 0.32 |
| HuBERT-Base (=2) | 2.26 | 0.26 |
| wav2vec2-Base (=3) | 2.41 | 0.48 |
| WavLM-Large (=5) | 2.02 | 0.25 |
| Whisper-Large (=4) | 2.27 | 0.37 |
| in-house CLAP (=6) | 1.88 | 0.37 |
Despite domain mismatch, the in-house CLAP encoder outperforms both acoustic-feature baselines and speech-trained FMs (Nie et al., 27 May 2025).
2. Model Interpretability and Insights
Attention visualization from HeartCLAP’s 6th ViT-B layer highlights cardiac cycle landmarks—S1 and S2 onsets—corresponding to inter-beat intervals critical for robust HR estimation. The system’s mid-level representations balance sensitivity to low-level acoustic features (energy, bandwidth) and higher-level context (beat periodicity). This suggests that CLAP, via its cross-modal contrastive objective, encodes non-linguistic, structured periodic patterns more efficiently than ASR-centric models.
3. Implementation Details and Pre-training
- Data splits: 6 random, 80% train / 10% validation / 10% test, ensuring no subject overlap.
- Augmentation: None beyond segment windowing and feature normalization.
- Optimization: Adam for downstream, AdamW with one-cycle learning rate (up to ) for CLAP pretraining.
- Contrastive audio–text pretraining: 30,000 steps, batch size 8192, projection to 512-D.
- No additional regularization (e.g., weight decay) during downstream regression.
4. Extensions, Limitations, and Future Directions
Planned directions include:
- PCG-specialized fine-tuning: To reduce domain mismatch for CLAP representations.
- Hybrid embeddings: Fusion of MFCC and CLAP embeddings via concatenation or late fusion for enhanced regression accuracy.
- Broader vital sign tasks: Extension to respiration rate estimation, murmur/arrhythmia detection.
- Model compression: Pruning, quantization, and distillation for on-device use.
A plausible implication is that domain-adaptive pretraining or explicit integration of clinical acoustic features could further lower estimation error and increase robustness on diverse auscultatory datasets.
5. HeartCLAP in Music Information Retrieval
In the HeartMuLa ecosystem, HeartCLAP denotes a dual-encoder contrastive model for music–text alignment (Yang et al., 15 Jan 2026). Both music and text encoders use MuQ-MuLan initialization and are projected via linear layers into a shared 1024-D space. Training employs a symmetric InfoNCE loss: on mini-batches of 1 million paired music clips with structured, unstructured, or natural-language annotations.
On the WikiMT-X music-text retrieval benchmark:
| Model | Text→Music R@1 | Music→Text R@1 | Text→Music mAP@10 | Music→Text mAP@10 |
|---|---|---|---|---|
| Laion-CLAP | 0.71 | 1.01 | 2.17 | 2.08 |
| MuQ-MuLan | 2.24 | 1.62 | 4.70 | 3.36 |
| HeartCLAP | 4.37 | 2.85 | 7.59 | 5.51 |
HeartCLAP provides the audio–text alignment backbone enabling controlled music generation and fine-grained music retrieval, with substantial improvement over prior CLAP-style models.
6. Related Technologies and Applications
In clinical biomarker quantification, the term “HeartCLAP” does not directly refer to the deep learning–enhanced chemiluminescence vertical-flow assay (CL-VFA) system (Han et al., 2024), though both leverage foundation model or neural network approaches for cardiovascular diagnostics. The CL-VFA system employs a neural signal-processing pipeline separate from CLAP/HeartCLAP but demonstrates the broader applicability of deep learning–based representation models for cardiac healthcare, specifically in point-of-care testing for cardiac troponin I.
7. Summary
HeartCLAP, in both its biomedical and music modalities, applies advanced representation learning—either as mid-level transformer features for vital-sign estimation or as a dual-encoder contrastive retriever for audio–text alignment—to domains requiring robust, contextually-informed embeddings. The system’s architecture, pretraining objective, and empirical performance underline its strengths for periodic signal structure and cross-modal retrieval, with ongoing development aimed at greater domain-specialization, feature integration, and deployment feasibility (Nie et al., 27 May 2025, Yang et al., 15 Jan 2026).