Cardiac-CLIP: Multi-Modal Cardiac Imaging

Updated 1 April 2026

Cardiac-CLIP is a suite of vision-language foundation models tailored for comprehensive cardiac imaging diagnostics using contrastive pretraining techniques.
It integrates multi-frame learning, EchoZoom, and multi-view fusion to capture key spatiotemporal features critical for LVEF quantification and anomaly detection.
By enabling few-shot adaptation and cross-modal retrieval, Cardiac-CLIP offers robust, generalizable performance across echocardiography, CT, and angiography modalities.

Cardiac-CLIP refers to a set of vision-language foundation models designed for comprehensive cardiac imaging interpretation using a contrastive pretraining paradigm. These architectures extend the CLIP (Contrastive Language–Image Pre-Training) framework to domains such as echocardiography, cardiac CT, and coronary angiography, leveraging video and volumetric encoders, multi-view modeling, and specialized attention mechanisms to address the spatiotemporal complexity of cardiac diagnostics.

1. Motivation and Clinical Context

Automated cardiac imaging interpretation is constrained by the need for large-scale, annotated datasets and by models that lack domain-specific temporal sensitivity. Left ventricular ejection fraction (LVEF) estimation is a key quantitative marker of systolic function, but traditional deep learning approaches require substantial manual labeling and are not adaptable to new clinical environments with limited annotated data. Image-text CLIP variants such as EchoCLIP illustrate that large vision-LLMs can capture clinical semantics, but these models typically neglect critical temporal phenomena and localized anatomical details present in cardiac cycles, reducing clinical reliability in tasks such as LVEF quantification from echocardiogram videos (Du et al., 21 Sep 2025, Christensen et al., 2023, Takizawa et al., 26 Apr 2025).

Cardiac-CLIP frameworks are motivated by the need to:

Integrate spatial and temporal signals from video or volumetric cardiac imaging.
Enable robust few-shot adaptation across diverse hospital and device settings.
Provide generalizable foundations for both zero-shot and fine-tuned clinical predictions, including cross-modal retrieval and structured report mapping.

2. Model Architectures and Extensions

Cardiac-CLIP encompasses several modality-adapted CLIP frameworks:

Echocardiogram Video Models: CardiacCLIP ("Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner") employs a ConvNeXt-Base CLIP backbone for per-frame feature extraction, with additional modules for attention-based multi-frame learning (MFL) and multi-scale spatial encoding ("EchoZoom") (see Section 3).
Multi-View Echocardiography: A ViViT-based encoder processes up to five standard 2D echo views (LAX, SAX, 2CH, 3CH, 4CH), with BERT-based text encoding and mean-pooling for multi-view fusion (Takizawa et al., 26 Apr 2025).
3D Cardiac CT: Cardiac-CLIP adapts the framework to volumetric data using a 3D masked autoencoder (ViT-B/32) for self-supervised pretraining, followed by contrastive alignment with PubMedBERT-encoded structured reports and affinity-weighted supervision (Hu et al., 29 Jul 2025).
Coronary Angiography: DeepCORO-CLIP extends to multi-view angiogram clips, utilizing an MViT-v2 video encoder and BioMedBERT text encoder, with CLS-token attention pooling for study-level integration and transfer learning for prognostic endpoints (Harrabi et al., 18 Mar 2026).

A generalized architecture involves:

A modality-specific visual encoder (frame-based, video-based, or volumetric transformer).
A biomedical domain text encoder (e.g., GPT-2, BERT, PubMedBERT).
Projection heads mapping both modalities to a shared latent space, optimized via contrastive objectives.

3. Core Methodological Innovations

3.1 Multi-Frame Learning (MFL)

Standard CLIP video adaptations average frame features across the cardiac cycle, which can obscure diagnostic phases such as end-systole and end-diastole. MFL deploys a multi-instance learning attention mechanism, where a frame-scoring network parameterizes attention weights over frame-level embeddings:

$s_i = W_3 \tanh(W_2 \tanh(W_1 f_i))$

$\alpha_i = \frac{\exp(s_i)}{\sum_{j=1}^B \exp(s_j)}$

$F_\mathrm{agg} = \sum_{i} \alpha_i f_i$

$F_\mathrm{final} = W_\mathrm{proj} F_\mathrm{agg}$

This yields a video embedding that selectively emphasizes diagnostically salient frames, providing sensitivity to dynamic contractile features (Du et al., 21 Sep 2025).

3.2 EchoZoom Multi-Scale Feature Extraction

EchoZoom introduces multi-scale spatial aggregation without increasing model parameters. The process entails cropping a 224×224 input into one central and four quadrant 112×112 patches. Each patch is processed by the visual encoder, and the resulting feature maps are averaged to enhance both global context and fine-grained border localization, enabling robust anatomical representation—critical for reliably tracking ventricular morphology (Du et al., 21 Sep 2025).

3.3 Multi-View Aggregation

For multi-view echocardiography and angiography, view-specific video embeddings $v^{(j)}$ are averaged to yield a composite study-level vector:

$\bar v = \frac{1}{N} \sum_{j=1}^N v^{(j)}$

Alternatively, in DeepCORO-CLIP, multi-view embeddings are fused via CLS-token based multi-head self-attention for more nuanced inter-view interactions (Takizawa et al., 26 Apr 2025, Harrabi et al., 18 Mar 2026).

3.4 3D Masked Autoencoder and Affinity-Weighted Contrastive Alignment

In 3D CT, the visual encoder is pretrained to reconstruct masked volumetric patches. Structured reports are mapped to fixed pathology vectors, and semantic soft label matrices modulate the contrastive objective, aligning case similarity in both vision and text (Hu et al., 29 Jul 2025).

4. Training Objectives, Loss Functions, and Evaluation

Across Cardiac-CLIP variants, the primary objective is a symmetric contrastive loss aligning video (or volume) and text in a shared space. Specific adaptations for clinical tasks include:

Ordinal regression loss for LVEF quantification: $L_\mathrm{OR} = L_\mathrm{CE} + L_\mathrm{MAE}$ , where $L_\mathrm{CE}$ is categorical cross-entropy over LVEF bins and $L_\mathrm{MAE}$ is a regression refinement (Du et al., 21 Sep 2025).
InfoNCE for contrastive pretraining: Both video→text and text→video losses are computed with a learnable temperature parameter.
Affinity-weighted loss in CT: Supervision is weighted by semantic similarity between pathology vectors.

Evaluation metrics include:

MAE/RMSE for continuous outcomes (e.g., LVEF).
AUROC, Recall@K, Precision@K for classification and retrieval tasks.
Domain-specific endpoints, such as coronary lesion quantification, cross-modal embedding clustering, prognostic prediction (MACE AUROC), and fine-grained grading (e.g., coronary calcium) (Hu et al., 29 Jul 2025, Harrabi et al., 18 Mar 2026).

5. Experimental Results and Performance Comparison

Key performance highlights (echo, CT, and angiography domains):

Cardiac-CLIP Variant	Modality	Core Task	1-Shot MAE	AUROC / Retrieval
CardiacCLIP (MFL+EchoZoom)	Echo (EchoNet-Dynamic)	LVEF (few-shot, S=1)	7.25	n/a
EchoCLIP (image, external)	Echo	LVEF (zero-shot)	7.1	n/a
MultiView Video CLIP	Echo (MultiView, 5 views)	Report retrieval (MCMRR/R@10, V→R)	595 / 10.9%	-
Cardiac-CLIP (3D CT)	CT (12 centers, CT-RATE)	Abnormality Classif. (zero-shot/fine-tune)	n/a	0.84 / 0.92
Cardiac-CLIP (CT)	CT (CT-RATE)	Image→Text Retrieval (R@5)	n/a	0.62
DeepCORO-CLIP	Angiography	Stenosis AUROC (internal/external)	n/a	0.888 / 0.890
DeepCORO-CLIP	Angiography	LVEF (regression MAE)	7.3	-

In few-shot echo LVEF prediction, CardiacCLIP achieves MAE reduction of 2.07 vs. EchoNet in 1-shot (p < 0.01). In video+multi-view echo report retrieval, mean cross-modal retrieval rank nearly halves compared to single-frame CLIP baselines. In 3D CT, zero-shot AUROC for multi-site abnormality classification reaches 0.84–0.92, outperforming prior models by ≥0.07 AUROC (Hu et al., 29 Jul 2025). DeepCORO-CLIP outperforms clinical reports for QCA-matched stenosis estimation and achieves strong LVEF regression despite cross-modality (acquiring TTE ground truth for CAG video queries) (Harrabi et al., 18 Mar 2026).

6. Clinical Applicability, Limitations, and Prospects

Cardiac-CLIP architectures demonstrate clinical value in real-world deployment scenarios:

Few-shot adaptation: Robust learning from sparse annotations enables cross-site/hospital deployment with minimal retraining (Du et al., 21 Sep 2025).
Rapid inference: CardiacCLIP supports real-time LVEF prediction; DeepCORO-CLIP achieves PACS round-trips in ~4.2 seconds per angiography study (Harrabi et al., 18 Mar 2026).
Cross-modal retrieval: Supports image-to-text and text-to-image queries for reporting and retrospective analysis (Takizawa et al., 26 Apr 2025, Hu et al., 29 Jul 2025).

Limitations include potential domain shifts in ultrasound/video textures, lack of explicit supervision for dynamic cardiac phases (in MFL), and current focus on pre-defined pathology/attribute sets. Several variants lack end-to-end segmentation or region-specific attention, and some (notably 3D CT) are limited by dataset availability and privacy constraints (Hu et al., 29 Jul 2025).

Prospective directions involve:

Augmenting temporal modeling using transformers over MFL heads.
Incorporating segmentation or CLIP-SAM hybrids for region-guided attention.
Extending to federated/multi-center learning and integrating structured EHR data.
Generalizing to additional modalities such as cardiac MRI or PET/OCT, as indicated by DeepCORO-CLIP's blueprint for modular extension (Harrabi et al., 18 Mar 2026).

Cardiac-CLIP builds on and generalizes principles from EchoCLIP (single-frame echo CLIP), EchoCLIP-R (long-form report modeling), and extends the CLIP/contrastive paradigm to the temporal and spatial complexity of cardiac imaging. The deep use of video transformers, semantic soft label affinity, and attention-based multi-instance pooling aligns with concurrent developments in general medical vision-language pretraining, as well as pioneering work in multi-modality, multi-view fusion, and contrastive representation learning for clinical applications (Christensen et al., 2023, Takizawa et al., 26 Apr 2025, Hu et al., 29 Jul 2025, Harrabi et al., 18 Mar 2026). The design choices in CLIP adaptation—video vs. frame, multi-view pooling, structured prompt engineering, and hybrid ordinal-continuous regression—directly affect transferability, data efficiency, and generalization, marking Cardiac-CLIP as an extensible foundation for automated, scalable cardiac imaging interpretation.