FetalCLIP: Fetal Ultrasound V-L Model

Updated 4 July 2026

FetalCLIP is a visual-language foundation model for fetal ultrasound that learns semantically-rich anatomical representations from paired images and clinician annotations.
It employs a dual-encoder CLIP design with a ViT-L image encoder and transformer text encoder, optimized via symmetric contrastive loss on over 210,000 image-text pairs.
The model transfers effectively to tasks like view classification, gestational age estimation, segmentation, and fetal CHD detection, demonstrating significant performance gains.

Searching arXiv for the specified papers and closely related work on FetalCLIP and its derivatives. FetalCLIP is primarily a visual-language foundation model for fetal ultrasound image analysis that follows a CLIP dual-encoder design and is pretrained on paired fetal ultrasound images and text to learn a shared embedding space for fetal anatomical semantics (Maani et al., 20 Feb 2025). In the cited literature, the label also appears in a broader sense for fetal-focused CLIP instantiations, including a CLIP-driven zero-shot fetal head localizer inside the SaLIP cascade, fetal brain histology representation learning in CytoCLIP, and region-aware fetal ultrasound pretraining in SonoCLIP (Aleem et al., 2024, Ta et al., 18 Jan 2026, Su et al., 28 Jun 2026). The primary referent, however, is the fetal ultrasound foundation model introduced in 2025, whose stated aim is to provide universal fetal sonography representations for classification, gestational age estimation, congenital heart defect detection, and structure segmentation (Maani et al., 20 Feb 2025).

1. Conceptual basis and problem setting

FetalCLIP was proposed to address several difficulties that are recurrent in fetal ultrasound: speckle noise, operator-dependent acquisition, large anatomical variability across standard scanning planes and developmental stages, subtle fine-grained structures that require expert knowledge for interpretation, and the scarcity of high-quality paired multimodal data in obstetric ultrasound (Maani et al., 20 Feb 2025). The model is explicitly designed to learn universal representations tailored to fetal sonography, semantically grounded through text and intended to transfer across heterogeneous downstream tasks without requiring large task-specific datasets.

The underlying premise is that contrastive language-image learning can encode plane-level, structure-level, and developmental cues that are difficult to recover from image-only pretraining. In FetalCLIP, this is operationalized by aligning fetal ultrasound frames with captions derived from clinician annotations, gestational age, pixel spacing, and curated textbook descriptions. A plausible implication is that the textual pathway is not merely auxiliary metadata conditioning but an organizing prior for anatomical granularity, especially for cardiac and brain subviews.

Later work positions FetalCLIP as a domain-specialized alternative to general CLIP and medical VLMs such as BiomedCLIP and UniMed-CLIP, and as one of the core ultrasound foundation models in fetal plane classification benchmarks (Barrientos et al., 27 May 2026). That comparative role is important because subsequent papers use FetalCLIP both as a pretrained encoder and as a reference point for newer fetal ultrasound foundation models.

2. Pretraining corpus and data curation

The original pretraining corpus comprises 210,035 paired image-text examples: 207,943 fetal ultrasound images from routine clinical prenatal scans at Corniche Hospital plus 2,092 image-caption pairs extracted from a fetal ultrasound textbook centered on the fetal heart (Maani et al., 20 Feb 2025). The clinical cohort spans 6,493 patients with mean gestational age $148 \pm 16$ days, and 50% of images fall between 20 weeks 0 days and 21 weeks 6 days. Imaging is 2D B-mode.

Clinician annotations embedded in the images were OCR'd and standardized into 64 keywords predominantly covering fetal anatomical structures, including abdomen, brain, femur, heart, cardiac subviews such as LVOT, RVOT, 4-CH, and 3VV/3VT, and brain subviews such as thalamic, cerebellum, and ventricular (Maani et al., 20 Feb 2025). The clinical data were organized into three subgroups: 88,045 standard-view images over 12 standard views, 73,972 diverse multi-label images, and 79,757 initially unlabeled images. Confident-learning–based cleaning removed 984 samples from the standard-view subset, leaving 87,061 images, while pseudo-labeling retained 46,910 initially unlabeled images at $>90\%$ confidence.

A notable part of the curation pipeline is explicit leakage reduction. The ultrasound fan region was isolated via contouring, and colored overlays, including embedded text, were removed via Fast Marching Method inpainting (Maani et al., 20 Feb 2025). Absent report-level text for the clinical scans, GPT-4o generated five caption templates per unique set of clinician labels, integrating pixel spacing and gestational age; textbook figures were split into 2,092 subfigures and their captions refined with GPT-4o to be self-contained and to remove visual references. Because the textbook subset was 100 $\times$ smaller, it was upsampled 10 $\times$ and sharded to avoid duplicates within a batch in contrastive pretraining.

Subsequent papers refer to the teacher corpus using different counts. A benchmark paper describes FetalCLIP as pretrained on 207,943 paired fetal ultrasound images and captions (Barrientos et al., 27 May 2026), while MobileFetalCLIP describes the teacher as trained on 246,349 fetal ultrasound image-caption pairs encompassing routine second-trimester scans, LLM-generated captions, and expert-annotated textbook material (Saeed et al., 5 Mar 2026). This suggests that later papers reference different accounting conventions or later curation snapshots rather than a single immutable corpus definition.

3. Architecture and pretraining objective

FetalCLIP follows the CLIP dual-encoder pattern with an image encoder and a text encoder projecting into a joint embedding space (Maani et al., 20 Feb 2025). The vision encoder is a ViT-L with 224 $\times$ 224 input, 14 $\times$ 14 patches, and 24 transformer layers. The text encoder is a transformer with 12 layers and a maximum input length of 117 tokens, using Byte-Pair Encoding tokenization. Both encoders map inputs to a 768-dimensional shared space via learned projections.

No new ultrasound-specific modules are introduced in the encoders; domain adaptation arises from fetal-specific paired data, ultrasound-aware textual supervision, and preprocessing to remove overlays (Maani et al., 20 Feb 2025). The pretraining objective is CLIP-style symmetric contrastive learning. The paper does not print the loss explicitly, but the formulation given for clarity is the standard symmetric InfoNCE objective:

$L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$

Training uses random rotation $\theta_{\mathrm{rotation}} \in [-7^\circ, 7^\circ]$ , translation $\theta_{\mathrm{translation}} \in [-0.05, 0.05]$ , and color jitter for brightness, contrast, and saturation in $[0.85, 1.15]$ , with weight decay $>90\%$ 0 (Maani et al., 20 Feb 2025). Initializing from a general medical CLIP, ViT-L/14 "pmc_vit_l_14", improves zero-shot planes F1 from $>90\%$ 1 to $>90\%$ 2. Optimization details reported in the paper are learning rate $>90\%$ 3, 2,000 warmup steps, cosine schedule, 20 epochs, mixed precision, 4 $>90\%$ 4 RTX A6000 GPUs, and batch size 140 per GPU. The optimizer choice and temperature $>90\%$ 5 are not reported.

The architecture is intentionally conservative. Rather than adding explicit ultrasound adapters, prompt learners, or auxiliary objectives, FetalCLIP depends on fetal-domain paired supervision and downstream frozen-encoder adaptation. This design choice becomes important in later comparisons, because several follow-up systems retain the CLIP backbone idea but add temporal modeling, mask channels, LoRA adapters, or distillation.

4. Downstream transfer and empirical profile

The original evaluation emphasizes zero-shot and frozen-encoder transfer across view classification, gestational age estimation, video-based CHD detection, and fetal structure segmentation (Maani et al., 20 Feb 2025).

Task	Protocol	Key result
Zero-shot planes/subplanes	PlanesDB, prompt ensembling	Mean F1 $>90\%$ 6
Zero-shot gestational age	HC18, prompt matching from 14w0d to 40w0d	$>90\%$ 7 validity
CHD video detection	Frozen encoder, 16-frame clips, private 4CH videos	AUROC $>90\%$ 8
Frozen-encoder segmentation	Lightweight UNETR-style decoder	DSC brain $>90\%$ 9, abdomen $\times$ 0, four-chamber $\times$ 1

For zero-shot classification on PlanesDB, FetalCLIP achieves an average F1 of $\times$ 2, outperforming SonoNet by $\times$ 3, UniMed-CLIP by $\times$ 4, BiomedCLIP by $\times$ 5, and CLIP by $\times$ 6 (Maani et al., 20 Feb 2025). Prompt construction matters: using prompts grounded in FetalCLIP’s caption templates raises mean F1 by $\times$ 7 across the five standard planes compared with generic CLIP-style prompts. For HC18 zero-shot gestational age estimation, the method infers gestational age via image-text similarity against five prompts per candidate GA from 14w0d to 40w0d, and a median-of-top-15 post-processing rule improves validity by $\times$ 8 relative to taking only the top similarity. Validity is defined by whether the true head circumference lies within the WHO $\times$ 9th– $\times$ 0th percentile for the predicted GA under

$\times$ 1

For CHD detection, the frozen image encoder is applied framewise to 418 four-chamber-view ultrasound videos, and clip-level feature concatenation outperforms averaging, yielding mean AUROC $\times$ 2 versus $\times$ 3 for averaging (Maani et al., 20 Feb 2025). For segmentation, a lightweight 2D UNETR-style decoder with depthwise deconvolutions and depthwise separable convolutions is trained on frozen encoder features, with average DSC per view of $\times$ 4 for brain, $\times$ 5 for abdomen, and $\times$ 6 for four-chamber. The reported gains over UniMed-CLIP are $\times$ 7, $\times$ 8, and $\times$ 9, respectively.

The representation analyses are consistent with a semantically structured embedding space. ScoreCAM visualizations focus on the stomach in abdomen views, the femur bone, heart circumference, cerebellar hemispheres, cavum septum pellucidum, head circumference regions, and choroid plexus, while UMAP shows clear clustering of five standard planes and separation of brain subviews (Maani et al., 20 Feb 2025). The paper also emphasizes data efficiency: with frozen encoders, training on only 32 patients can match or exceed UniMed-CLIP trained on the full dataset of 717 patients.

5. Derivatives, adaptations, and comparative systems

FetalCLIP rapidly became a backbone, baseline, and distillation teacher for fetal ultrasound research. In TPA, it is used as the frozen image-text backbone for fetal CHD classification in ultrasound videos, while a trainable temporal extractor aggregates 768-dimensional frame embeddings into a 256-dimensional video embedding aligned with class-specific prompts through a margin-hinge contrastive loss (Taratynova et al., 21 Aug 2025). On a private fetal CHD dataset, TPA reports a macro F1 of $\times$ 0 for CHD diagnosis, and the CVAESM module reduces ECE by $\times$ 1 absolute and AECE by $\times$ 2 absolute. This positions FetalCLIP as a base representation model that can be extended from image-level to video-level reasoning without end-to-end retraining of the foundation encoders.

In low-resource image quality assessment, FetalCLIP is adapted with LoRA into FetalCLIP $\times$ 3, where the pretrained image encoder is frozen and Low-Rank Adaptation modules are inserted into the Transformer attention and feed-forward blocks (He et al., 30 Jul 2025). On ACOUSLIC-AI, FetalCLIP $\times$ 4 achieves Accuracy $\times$ 5 and F1 $\times$ 6 with 2.4M trainable parameters, while a repurposed segmentation variant, FetalCLIPSEG, reaches F1 $\times$ 7 by thresholding predicted foreground area at $\times$ 8 of the image.

MobileFetalCLIP reframes FetalCLIP as a large teacher for mobile deployment constraints (Saeed et al., 5 Mar 2026). In that paper, the FetalCLIP teacher has a ViT-L/14 image encoder with 304M visual parameters and 427M total parameters, whereas the student uses a FastViT image encoder with 11.4M visual parameters. Distillation is performed with Selective Repulsive Knowledge Distillation, which keeps diagonal matched-pair alignment attractive while driving the off-diagonal term through a schedule that becomes negative. The student surpasses the teacher on zero-shot HC18 biometry validity, $\times$ 9 versus $\times$ 0, and on brain sub-plane F1, $\times$ 1 versus $\times$ 2, while achieving 1.6 ms encoder latency on an iPhone 16 Pro.

Comparative benchmarking also clarifies where FetalCLIP transfers well and where it does not. In a 2026 benchmark of ultrasound foundation models for fetal plane classification, FetalCLIP is best in the linear probing regime, with F1 $\times$ 3 on the Spanish in-domain test set and F1 $\times$ 4 on an external African cohort (Barrientos et al., 27 May 2026). Under full fine-tuning, however, FetalCLIP degrades to F1 $\times$ 5 in-domain and $\times$ 6 out-of-domain, while USFM becomes the best-performing model. The benchmark interprets this as evidence that FetalCLIP’s multimodal fetal ultrasound pretraining yields highly transferable frozen features, but that end-to-end updates on a relatively small labeled dataset can perturb or overfit those aligned features.

SonoCLIP constitutes a more architectural reworking of the fetal ultrasound CLIP idea (Su et al., 28 Jun 2026). It uses a CLIP ViT-L/14@336 backbone with a mask-channel branch fused by depthwise convolution, mixes global plane-level and region-level mask-text pairs, and trains with a sigmoid-based pairwise contrastive loss. On cross-center zero-shot evaluation over 24 fetal planes, the reported Top-1/Top-5 accuracies are $\times$ 7 for FetalCLIP, $\times$ 8 for SonoCLIP without mask-guided inference, and $\times$ 9 for SonoCLIP with mask-guided inference. This comparison indicates that explicit region controllability can address limitations of global-only alignment in fetal ultrasound.

6. Broader usage, limitations, and scope expansion

The name “FetalCLIP” is also used more loosely for fetal-focused CLIP mechanisms beyond the original 2025 fetal ultrasound model. In SaLIP, the paper states that the CLIP retrieval stage effectively acts as a CLIP-based fetal ROI localizer: a zero-shot “FetalCLIP” that scores SAM-generated fetal ultrasound crops against visually descriptive text prompts for “fetal head” and selects the most semantically consistent region (Aleem et al., 2024). On HC18 fetal head segmentation, that cascade reports DSC $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 0 and mIoU $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 1, compared with DSC $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 2 and mIoU $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 3 for un-prompted SAM, while remaining training- and fine-tuning-free.

CytoCLIP extends the fetal-focused CLIP concept beyond ultrasound into Nissl-stained fetal brain histology (Ta et al., 18 Jan 2026). It trains CLIP-derived models on fetal brains spanning 14–24 gestational weeks, with a low-resolution whole-region variant and a high-resolution tile variant. Reported headline performance is F1 $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 4 for whole-region classification and F1 $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 5 for tile classification, but cross-age and cross-plane generalization remain limited, with whole-region F1 approximately $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 6 across ages and approximately $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 7 across planes. This broadens the semantic scope of “FetalCLIP” from obstetric sonography to fetal developmental neuroanatomy.

The original FetalCLIP has several stated limitations (Maani et al., 20 Feb 2025). Most pretraining images come from routine second-trimester scans, and zero-shot gestational age estimation degrades at early and late gestation. The pretraining data contain limited explicit pathology labels beyond textbook cardiac cases, so zero-shot abnormality detection remains challenging. All encoders are trained and inferred at 224 $L = -\frac{1}{N} \sum_{i=1}^{N} \left[ \log \frac{\exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{I}_{i}, z^{T}_{j})/\tau)} + \log \frac{\exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{i})/\tau)}{\sum_{j=1}^{N} \exp(\mathrm{cos}(z^{T}_{i}, z^{I}_{j})/\tau)} \right].$ 8224 because of compute constraints, the model is image-based rather than video-native, and potential biases across populations and gestational-age distributions are possible given the single-center clinical pretraining source. The private datasets used for several downstream experiments cannot be released due to privacy regulations.

Taken together, the literature presents FetalCLIP as both a specific fetal ultrasound foundation model and a methodological template for fetal-domain vision-language representation learning. Its main historical role is to establish that paired fetal ultrasound image-text pretraining can yield strong zero-shot classification, usable gestational-age retrieval, competitive frozen-encoder segmentation, and broad transfer with limited labeled data (Maani et al., 20 Feb 2025). Subsequent work either adapts its embeddings, compresses it for deployment, or modifies the CLIP recipe to address temporal modeling, region controllability, and domain shift.