HeartCLAP: Audio Alignment in Biomed & Music

Updated 16 January 2026

HeartCLAP is a dual-purpose model that processes audio using contrastive pretraining for both heart rate estimation from PCG signals and music-text alignment.
It utilizes a CLAP audio encoder with a 12-layer ViT backbone and a lightweight MLP regressor to achieve an optimal MAE of 1.88 bpm in heart rate estimation.
In music information retrieval, HeartCLAP’s dual-encoder contrastive learning significantly improves audio-text retrieval benchmarks over previous CLAP-style models.

HeartCLAP refers to two distinct but foundational models developed for the processing and alignment of audio data within specialized biomedical and music contexts. In biomedical signal analysis, HeartCLAP denotes a heart-rate estimation framework built on Contrastive Language–Audio Pretraining (CLAP) representations for phonocardiogram (PCG) signals (Nie et al., 27 May 2025). In music information retrieval, HeartCLAP designates the dual-encoder alignment model within the HeartMuLa foundation model suite, tailored for robust cross-modal mapping between music audio and textual descriptions (Yang et al., 15 Jan 2026). Both share core design principles regarding audio representation but address separate application domains.

1. HeartCLAP for Heart Rate Estimation from Auscultation

HeartCLAP, as introduced by Wang et al., is a system for estimating heart rate from PCG using hidden representations extracted from an in-house CLAP audio encoder. This system leverages mid-level transformer features to encode periodic events corresponding to heart sounds (S1, S2), outperforming conventional acoustic-feature and automatic speech recognition (ASR)-oriented foundation models.

CLAP Audio Encoder Architecture

Input: 5-second mono PCG segments, resampled at 16 kHz, transformed to a 1×128×1024 log-Mel spectrogram (128 channels, 25 ms Hann window, 10 ms stride).
Embedding Layer: Conv2D patch embedding (16×16 kernel, non-overlapping), yielding 512 audio tokens.
Transformer Backbone: 12-layer, base Vision Transformer (ViT-B), 768 hidden units per layer, 12 self-attention heads, feed-forward size 3072, sinusoidal positional encoding.
Layer-wise Output: The $\ell$ -th layer produces a matrix $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ , updated as $H^{(\ell)} = f_{\ell}(H^{(\ell-1)})$ for $\ell=1,\ldots,12$ .

Heart Rate Estimation Pipeline

Preprocessing: 2 kHz low-pass filtering; 5 s windows with 1 s stride; per-window mean–variance normalization; ground-release labels via S1-onset intervals [Nie et al., 2024].
Representation Extraction: Obtain $H_i^{(\ell)}$ for each segment $x_i$ at each layer.
Pooling and Regression: Global mean-pooling on the token dimension ( $v_i^{(\ell)} = \mathrm{meanpool}(H_i^{(\ell)})$ ), followed by a lightweight MLP regressor:

$z_i = \mathrm{ReLU}(W_1 v_i^{(\ell)} + b_1), \quad \hat y_i = W_2 z_i + b_2$

Loss is mean absolute error (MAE).

Training and Evaluation: Six random 80/10/10 splits (no subject overlap), Adam optimizer ( $1\times 10^{-3}$ ), batch size 32, 50 epochs, early stopping on validation MAE.

Layer-wise Performance

Layer $\ell$	MAE (bpm), mean ± std
1	2.36 ± 0.42
2	2.12 ± 0.39
3	1.94 ± 0.38
4	1.90 ± 0.37
5	1.89 ± 0.37
6	1.88 ± 0.37
7	1.92 ± 0.39
...	...
12	2.05 ± 0.44

Optimal performance occurs at the 6th layer, achieving MAE = 1.88 bpm (σ = 0.37).

Comparison with Baseline and Foundation Models

Model	Best MAE (bpm)	Std
Baseline (mel/MFCC)	1.91	0.32
HuBERT-Base ( $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 0=2)	2.26	0.26
wav2vec2-Base ( $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 1=3)	2.41	0.48
WavLM-Large ( $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 2=5)	2.02	0.25
Whisper-Large ( $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 3=4)	2.27	0.37
in-house CLAP ( $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 4=6)	1.88	0.37

Despite domain mismatch, the in-house CLAP encoder outperforms both acoustic-feature baselines and speech-trained FMs (Nie et al., 27 May 2025).

2. Model Interpretability and Insights

Attention visualization from HeartCLAP’s 6th ViT-B layer highlights cardiac cycle landmarks—S1 and S2 onsets—corresponding to inter-beat intervals critical for robust HR estimation. The system’s mid-level representations balance sensitivity to low-level acoustic features (energy, bandwidth) and higher-level context (beat periodicity). This suggests that CLAP, via its cross-modal contrastive objective, encodes non-linguistic, structured periodic patterns more efficiently than ASR-centric models.

3. Implementation Details and Pre-training

Data splits: 6 random, 80% train / 10% validation / 10% test, ensuring no subject overlap.
Augmentation: None beyond segment windowing and feature normalization.
Optimization: Adam for downstream, AdamW with one-cycle learning rate (up to $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 5) for CLAP pretraining.
Contrastive audio–text pretraining: 30,000 steps, batch size 8192, projection to 512-D.
No additional regularization (e.g., weight decay) during downstream regression.

4. Extensions, Limitations, and Future Directions

Planned directions include:

PCG-specialized fine-tuning: To reduce domain mismatch for CLAP representations.
Hybrid embeddings: Fusion of MFCC and CLAP embeddings via concatenation or late fusion for enhanced regression accuracy.
Broader vital sign tasks: Extension to respiration rate estimation, murmur/arrhythmia detection.
Model compression: Pruning, quantization, and distillation for on-device use.

A plausible implication is that domain-adaptive pretraining or explicit integration of clinical acoustic features could further lower estimation error and increase robustness on diverse auscultatory datasets.

5. HeartCLAP in Music Information Retrieval

In the HeartMuLa ecosystem, HeartCLAP denotes a dual-encoder contrastive model for music–text alignment (Yang et al., 15 Jan 2026). Both music and text encoders use MuQ-MuLan initialization and are projected via linear layers into a shared 1024-D space. Training employs a symmetric InfoNCE loss: $H^{(\ell)}\in \mathbb{R}^{512\times 768}$ 6 on mini-batches of 1 million paired music clips with structured, unstructured, or natural-language annotations.

On the WikiMT-X music-text retrieval benchmark:

Model	Text→Music R@1	Music→Text R@1	Text→Music mAP@10	Music→Text mAP@10
Laion-CLAP	0.71	1.01	2.17	2.08
MuQ-MuLan	2.24	1.62	4.70	3.36
HeartCLAP	4.37	2.85	7.59	5.51

HeartCLAP provides the audio–text alignment backbone enabling controlled music generation and fine-grained music retrieval, with substantial improvement over prior CLAP-style models.

In clinical biomarker quantification, the term “HeartCLAP” does not directly refer to the deep learning–enhanced chemiluminescence vertical-flow assay (CL-VFA) system (Han et al., 2024), though both leverage foundation model or neural network approaches for cardiovascular diagnostics. The CL-VFA system employs a neural signal-processing pipeline separate from CLAP/HeartCLAP but demonstrates the broader applicability of deep learning–based representation models for cardiac healthcare, specifically in point-of-care testing for cardiac troponin I.

7. Summary

HeartCLAP, in both its biomedical and music modalities, applies advanced representation learning—either as mid-level transformer features for vital-sign estimation or as a dual-encoder contrastive retriever for audio–text alignment—to domains requiring robust, contextually-informed embeddings. The system’s architecture, pretraining objective, and empirical performance underline its strengths for periodic signal structure and cross-modal retrieval, with ongoing development aimed at greater domain-specialization, feature integration, and deployment feasibility (Nie et al., 27 May 2025, Yang et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation (2025)

HeartMuLa: A Family of Open Sourced Music Foundation Models (2026)

Deep learning-enhanced chemiluminescence vertical flow assay for high-sensitivity cardiac troponin I testing (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HeartCLAP.

HeartCLAP: Audio Alignment in Biomed & Music

1. HeartCLAP for Heart Rate Estimation from Auscultation

CLAP Audio Encoder Architecture

Heart Rate Estimation Pipeline

Layer-wise Performance

Comparison with Baseline and Foundation Models

2. Model Interpretability and Insights

3. Implementation Details and Pre-training

4. Extensions, Limitations, and Future Directions

5. HeartCLAP in Music Information Retrieval

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HeartCLAP: Audio Alignment in Biomed & Music

1. HeartCLAP for Heart Rate Estimation from Auscultation

CLAP Audio Encoder Architecture

Heart Rate Estimation Pipeline

Layer-wise Performance

Comparison with Baseline and Foundation Models

2. Model Interpretability and Insights

3. Implementation Details and Pre-training

4. Extensions, Limitations, and Future Directions

5. HeartCLAP in Music Information Retrieval

6. Related Technologies and Applications

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research