Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Foundation Models in Medical Imaging -- A Review and Outlook (2506.09095v3)

Published 10 Jun 2025 in eess.IV, cs.AI, and cs.CV

Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.

Summary

The paper demonstrates that foundation models pre-trained via self-supervised learning significantly reduce the need for extensive labeled data in medical imaging.
It reviews key methodologies such as contrastive learning, masked image modeling, and parameter-efficient fine-tuning applied to CNN and Vision Transformer architectures.
The review outlines how these models enhance diagnostic accuracy across pathology, radiology, and ophthalmology while addressing technical and clinical implementation challenges.

Foundation Models (FMs) are revolutionizing medical image analysis by addressing key challenges like data scarcity and the need for task-specific models. Unlike traditional supervised learning methods that require extensive labeled data, FMs are pre-trained on large collections of unlabeled medical images to learn general-purpose visual features. These features can then be adapted to various downstream clinical tasks, often with significantly less labeled data than required by traditional methods. This review examines the development, application, and challenges of FMs across pathology, radiology, and ophthalmology.

The core technical concepts underpinning FMs in medical imaging involve large-scale pre-training, self-supervised learning (SSL), and effective adaptation strategies. Large-scale pre-training leverages vast datasets to train encoder architectures, typically Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), to extract rich, generalizable features. ViTs (2010.11929) have gained prominence due to their scalability and ability to capture long-range dependencies, although CNNs [resnet] remain effective, especially with limited data or when integrated into hybrid architectures.

Self-supervised learning is crucial because large labeled datasets are scarce in medical imaging. SSL uses pretext tasks that generate supervision signals directly from the data. These can be discriminative methods like contrastive learning (e.g., SimCLR (2002.05709), MoCo (1911.05722), DINO (2104.14294)) which distinguish between different data views, or generative methods like Masked Image Modeling (MIM) (e.g., MAE (2111.06377), iBOT (2111.07832)) which reconstruct masked parts of the input. Multimodal SSL, using objectives like Image-Text Contrastive Learning (ITC) as in CLIP (2103.00020), aligns representations from different modalities, enabling Vision-Language FMs (VLFMs).

Once pre-trained, FMs are adapted to specific downstream tasks. The adaptation spectrum ranges from lightweight methods like prompt engineering (for VLMs) and linear probing (training a simple classifier on frozen embeddings) to more resource-intensive approaches. Adding task-specific heads on top of a frozen FM is common for classification or segmentation. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA (2106.09685) and adapter tuning (1909.08059), update only a small subset of parameters, reducing computational costs. Full fine-tuning, updating all parameters, often yields the best task-specific performance but requires more data and computation.

Foundation Models in Pathology

Computational pathology relies heavily on analyzing gigapixel Whole Slide Images (WSIs). Due to their size, WSIs are often processed as smaller tiles in a two-stage pipeline: a pre-trained encoder extracts tile-level embeddings, followed by a task-specific head (often using Multi-Instance Learning for slide-level tasks). Early approaches used ImageNet pre-training [saillard2020predicting], but recent efforts focus on in-domain SSL on large pathology datasets.

Tile-level pathology FMs have seen significant development using various SSL frameworks:

Contrastive Learning: SimCLR has been applied to large and diverse histopathology datasets, demonstrating improved embedding quality [ciga2022self]. REMEDIS [2023.09.04.23294952] combined supervised pre-training on natural images with SimCLR on medical data for enhanced robustness. Pathology-specific contrastive methods like cluster-guided contrastive learning (CCL) [wang2023retccl] and semantically relevant contrastive learning (SRCL) [wang2022transformer] address challenges posed by semantic similarities in pathology patches.
Self-distillation with DINO: The DINO framework has been used for ViT-based pathology FMs like HIPT (2206.02647), which captures hierarchical WSI structures, and models trained on massive proprietary datasets [campanella2023computational]. UNI (2308.15474) and Virchow (2309.07778) are prominent examples trained on large pan-cancer datasets using DINOv2 (2304.07193), showing state-of-the-art performance on various tasks. HistoEncoder (2411.11458) uses DINO on an efficient XCiT backbone.
Masked Image Modeling (MIM): Inspired by NLP, MIM has become popular for pathology FMs. Phikon (2307.21232) used iBOT (2111.07832) on TCGA data and scaled to different ViT sizes. Subsequent models like UNI (2308.15474), Virchow (2309.07778), RudolfV (2401.04079), PLUTO (2405.07905), Hibou-L (2406.05074), H-optimus-O [hoptimus0], Virchow2G (2408.00738), and Phikon-v2 (2409.09173) have adopted DINOv2 as the preferred MIM framework, often scaling to billions of patches and millions of WSIs. Atlas (2501.05409) demonstrated that data diversity (stains, scanners, magnifications) can be more impactful than sheer data volume alone.

Vision-Language FMs (VLFMs) in pathology combine image and text data, often using cross-modal contrastive objectives. Models like MI-Zero [lu2023visual] and PLIP [huang2023visual] adapt the CLIP framework, aligning image patches with text descriptions from pathology reports or other sources. QUILTNET [ikezogwo2024quilt] and CONCH [lu2024visual] further scale VLFM training. Generative VLFMs like PathAsst (2305.15072), PA-LLaVA (2408.09530), PathChat (2312.07814), and Quilt-LLaVA [seyfioglu2024quilt] are instruction-tuned for multimodal dialogue and visual question answering.

Slide-level FMs aim to capture global WSI context, which is challenging due to gigapixel size. LongViT (2312.03558) uses LongNet (2004.05150) with dilated attention to process long patch sequences end-to-end. Recent models like Prov-GigaPath (2404.03512) and PRISM (2405.10254) combine tile-level encoders with LongNet or aggregation methods and Vision-Language alignment. THREADS (2501.16652) integrates molecular supervision.

The Segment Anything Model (SAM) (2304.02643), originally trained on natural images, has been explored for pathology segmentation [chauveau2023segment]. While out-of-the-box performance varies [deng2023segment], fine-tuning and adaptation methods [ranem2023exploring, zhang2023sam] show promise for specific tasks like nuclei or tumor bud segmentation.

Model	Backbone	SSL	Pre-Training Data Size (Patches/WSIs)	Source	Weights Availability	Data Availability
REMEDIS	ResNet-152	SimCLR	50M (29K)	TCGA	Yes	Yes
RetCCL	ResNet-50	CCL	15M (32K)	TCGA, PAIP	Yes	Yes
CTransPath	SwinTransformer	SRCL	15M (32K)	TCGA, PAIP	Yes	Yes
HIPT	ViT-HIPT	DINO	104M (11K)	TCGA	Yes	Yes
Campanella et al.	ViT-S	DINO	3B (400K)	MSHS*	No	No
Lunit	ViT-S/8, ViT-S/16	DINO	33M (37K)	TCGA, TULIP*	Yes	Yes/No
Phikon	ViT-B/16	iBOT	43M (6K)	TCGA	Yes	Yes
UNI	ViT-L/16	DINOv2	100M (100K)	Mass-100K*	Yes	No*
Virchow	ViT-H/14	DINOv2	2B (1.5M)	MSKCC*	Yes	No
RudolfV	ViT-L/14	DINOv2	1.2B (134K)	TCGA, Proprietary	No	No
PLUTO	ViT-S/8, ViT-S/16	Modified DINOv2	195M (158K)	TCGA, PathAI*	No	No
Hibou-L	ViT-L/14	DINOv2	1.2B (1.1M)	Proprietary	Yes	No
H-optimus-O	ViT-G	DINOv2	- (500K)	Proprietary	No	No
Virchow2G	ViT-G/14	DINOv2	1.9B (3.1M)	MSKCC*	No	No
Phikon-v2	ViT-L	DINOv2	456M (58.4K)	PANCAN-XL	Yes	Yes/No
Atlas	ViT-H/14	RudolfV (DINOv2-based)	520M (1.2M)	Proprietary	No	No

*Proprietary/Partially available data.

Model	Type	Backbone	SSL	Pre-Training Data Size (Image-Text Pairs)	Source	Weights Availability	Data Availability
PLIP	VLFM	ViT-B/32 + Transformer	ITC	208K	OpenPath	Yes	Yes
MI-Zero	VLFM	CTransPath + GPT2-medium	ITC	- (33K)	Proprietary	Yes	No
QuiltNet	VLFM	ViT-B/32 + PubmedBert	ITC	34K	QUILT	Yes	Yes
CONCH	VLFM	ViT-B/16 + GPT2-medium	iBOT* + CoCa	1.1M	PMC-Path, EDU	Yes	Yes
PathCLIP	VLFM	CLIP vision + Vicuna-13b text	ITC	207K	PathCap	Yes	Yes
PA-LLaVA	VLFM	PLIP vision + Lama3 text	ITC + ITM	1.4M	PMV, PMC-OA, Quilt-1M	Yes	Yes
PathChat	VLFM	Uni + LLama2	CoCa	100K	PMC-OA, Proprietary	Yes/No	Yes/No
Quilt-LLaVA	VLFM	QuiltNet vision + Vicuna text	ITC	723K	Quilt-1M	Yes	Yes

*iBOT is used for training their own image encoder.

Model	Type	Backbone	SSL	Pre-training Data Size (WSIs)	Source	Weights Availability	Data Availability
Giga-SSL	VFM	ResNet-18	SimCLR	12K	TCGA	Yes	Yes
LongViT	VFM	LongNet	DINO	10K	TCGA	Yes	Yes
PRISM	VLFM	Virchow + BioGPT	CoCa	590K	Proprietary	Yes	No
Prov-GigaPath	VLFM	ViT-G + PubMedBERT	DINOv2 + ITC	170K	Prov-Path	Yes	No
mSTAR	VLFM	ViT-L	mSTAR	26K	TCGA	No	No
KEEP	VLFM	Uni + PubMedBERT	Knowledge-enhanced ITC	143K	OpenPath, Quilt-1M	Yes	Yes
CHIEF	VLFM	CTransPath + CLIP text	SRCL (v) + ITC (vl)	61K	Proprietary	Yes	No
TITAN	VLFM	ViT-B + CONCHv1.5	iBOT (v) + CoCa (vl)	60K	GTEx	Yes	Yes
COBRA	VLFM	ABMIL + Mamba-2	COBRA	3K	mTCGA	Yes	Yes
THREADS	VFM	CONCHV1.5 (patch) + cGPT (gene)	Contrastive	47K	MBTG-47K	No*	Yes/No

*Authors train both an image encoder and a VL-alignment module. Partially available: TCGA and GTEx are open-source, but pretraining data from BWH and MGH are proprietary. Planned to be released.

Foundation Models in Radiology

Radiology encompasses diverse modalities (X-ray, CT, MRI, US, PET), presenting challenges due to varied data formats (2D, 3D) and non-standardized reports. Early models focused on transformer architectures for specific tasks without large-scale SSL. Contrastive vision-language learning emerged as an early SSL approach, aligning images with text descriptions [huang_gloria_2021].

Generalist VLFMs aim to handle multiple medical imaging modalities, often using datasets from sources like PubMed Central. Models like PubMed-CLIP (2112.13906), PMC-CLIP (2303.07240), BiomedCLIP (2303.00915), and MEDVInT (2305.10415) leverage large image-text pairs for multimodal tasks. LLaVA-Med (2306.00890), Qilin-Med-VL (2310.17956), and Med-Flamingo (2308.01390) specialize in Medical Visual Question Answering (MedVQA).

Model	Type	Imaging Domain	Backbone	SSL Objective	Data Size	Weights Availability	Data Availability
PubMedCLIP	VLM	2D Radiology*	Resnet50 + ViT-B/32	ITC	80K	Yes	Yes
UniMiSS	VM	CT(3D), X-Ray	MiT	MIM	15K	No	Yes
BiomedCLIP	VLM	Multimodal**	ViT-B/16 + PubMedBERT	ITC	15M	Yes	Yes
PMC-CLIP	VLM	2D Radiology*	ResNet50 + PubmedBERT	ITC + MLM	1.65M	Yes	Yes
MEDVInT	VLM	Multimodal**	ResNet50 + Transformer	MLM	227K	No	Yes
LLaVA-Med	VLM	Multimodal**	LLaVA	ITC	60K	Yes	Yes
LVM-Med	VM	CT(2D), MRI, X-ray, US	ResNet-50 + ViT	Graph Matching	1.3M	No	Yes
Med-Flamingo	VLM	Multimodal**	ViT/L-14 + LLaMA-7B	ITC	1.6M	Yes	Yes
RadFM	VLM	Various Radiologies*	3D ViT + MedLLaMA-13B	Generative ITC	16M***	Yes	Yes
Qilin-Med-VL	VLM	Multimodal**	ViT/L-14 + LLaMA-13B	ITC	580K	Yes	Yes
SAT	VLM	Multimodal**	3D U-Net + BioBERT	ITC	302K	No	Yes
VISION-MAE	VM	Various Radiologies*	Swin-T	MAE	2.5M	No	No
RadCLIP	VLM	CT (2D+3D), X-Ray, MRI	ViT-L/14	Contrastive	1.2M	No	Yes

Wide range of 2D radiological imaging types. **Multimodal denotes broad range of radiological imaging types, as well as pathological image types. **2D image-text pairs, including 15.5M 2D images and 500k 3D images.

Chest X-ray (CXR) is a subdomain with more FM development due to data availability and lower complexity. Models like MedCLIP (2210.10163) leverage unpaired data, while CheXzero (2212.00751) and CXR-CLIP (2307.07645) focus on high-quality paired data using contrastive learning and radiologist-designed prompts. KAD (2307.22287) incorporates medical knowledge graphs. UniChest (2403.13405) uses a conquer-and-divide framework for multi-source data generalization. ELIXR (2308.01317) and MAIRA-1 (2311.13668) (and MAIRA-2 (2406.04449)) use adapters for efficient VLM fine-tuning. CheXAgent (2401.12208) scaled instruction-tuned learning on a 6M sample dataset.

Model	Type	Backbone	SSL Objective	Data Size (Volumes/Pairs)	Weights Availability	Data Availability
BioViL-T	VLFM	Custom CNN–Transformer + BERT	Temporal Multi-Modal	174.1K	No	Yes
ELIXR	VLFM	SupCon + T5	ITC	220K	No	Yes
MaCo	VLFM	ViT-B/16 + BERT	MLM	377K	No	Yes
CXR-CLIP	VLFM	ResNet50 + BioClinicalBERT	ITC	15M	Yes	Yes
UniChest	VLFM	ResNet50 + BioClinicalBERT	ITC	686K	Yes	Yes
RAD-DINO	VFM	ViT-B/14	DinoV2	838K	No	Yes/No*
CheXagent	VLFM	Ensemble	ITC + IC	6.1M**	No*	No*
Ray-DINO	VFM	ViT-L	DinoV2	863K	No	Yes

*Partially available. Planned to be released. **Does not provide pre-training dataset size, but finetunes on 6.1M samples.

Extending FMs to 3D modalities like CT and MRI is challenging. Many early VLFMs for 3D data used 2D slices [lei2023clip]. While models like M3FM [niu2023medical] and MedBLIP [chen2023medblip] incorporated 3D data by processing patches, fully integrated 3D models are less common. RadFM (2308.02463) processed full 3D volumes and introduced a large 2D/3D dataset, but was limited in vision-specific tasks. M3D-LaMed (2404.00578) and SAT (2312.17183) improved 3D segmentation capabilities. Rad-CLIP (2403.09948) and CT-CLIP (2403.17834) adapted CLIP for 3D, with the latter using a 3D ViT on a dedicated CT dataset. Merlin (2406.06512) integrates structured EHR and unstructured text with 3D CT.

Vision-Only models are emerging in radiology for spatially focused tasks. Early efforts used multi-view or sequence-based approaches for 3D data [jun2021medicaltransformeruniversalbrain]. Recent models employ knowledge distillation [ye_desd_2022, jiang_self-supervised_2022], MIM [xie2022unimiss], or graph-based SSL [nguyen_lvm-med_2023]. RAD-DINO (2401.10815) and Ray-DINO (2405.01469) adapted DINOv2 for medical imaging, demonstrating strong performance and generalization from image-only pre-training. VISION-MAE (2402.01034) applied MAE to 3D data. SAM has also been evaluated for radiology segmentation [mazurowski2023segment, roy_sammd_2023, gao_desam_2023], showing versatility but facing domain adaptation challenges.

Model	Type	Imaging Domain	Backbone	SSL Objective	Data Size (Volumes)	Weights Availability	Data Availability
DeSD	VM	CT	3D ResNet50	DeSD	11K	No	Yes
SMIT	VM	CT, MRI	SWIN-small	MIM + Self-Distillation	3643	Yes	No
Medical Transformer	VM	MRI	ResNet-18 + Transformer	MAE	1783	No	Yes
M3FM	VLM	CT	3D CT-ViT + Transformer	ITC	163K	Yes	Yes
CLIP-LUNG	VLM	CT	ViT-B/16 + ResNet18	ITC	1010	No	Yes
Niu et al.	VM	CT	3D ViT	Region-Contrastive	684	No	Yes
MedBLIP	VLM	MRI	MedQFormer + BioMedLM*	ITC	30K	Yes	Yes
MeTSK	VM	fMRI	STGCN	GraphCL	1415	No	Yes
Pai et al.	VM	CT	3D ResNet	Modified SimCLR	11.5K	Yes	Yes
M3D-LaMed	VLM	CT	3D ViT + LLaMA-2	ITC	120K	Yes	Yes
Merlin	VLM	CT	I3D ResNet152 + Clinical-Longformer	Contrastive	25K	No*	No*
CT-CLIP	VLM	CT	3D CT-ViT + CXR-Bert	ITC	26K	Yes	Yes

*Planned to be released. Authors experiment with multiple language encoders. Table shows their best performing model architecture.

Foundation Models in Ophthalmology

Ophthalmology benefits from relatively larger datasets for modalities like Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). These images are rich in detail and can indicate both eye diseases and systemic conditions.

The first ophthalmology FM was RETFound (2303.10126), a ViT pre-trained on 1.6 million retinal images using MAE. It showed strong performance on detecting eye and systemic diseases with minimal fine-tuning. Variants like RETFound-Green (2405.00117) focused on efficiency, while DERETFound [yan2025expertise] integrated synthetic data and expert text tagging. FLAIR (2308.07898) used CLIP to align CFP images with text descriptions for enhanced performance. DRStageNet (2312.14891) is a CFP model specifically tailored for diabetic retinopathy staging using DINOv2.

Multi-imaging ophthalmology FMs aim to handle diverse modalities beyond just retina. VisionFM (2310.04992) integrates multiple ophthalmic imaging types for multi-task diagnosis and segmentation, though specific architectural details are limited. EyeFound (2405.11338) is a general ophthalmic FM with a large ViT backbone and MAE pre-training, showing strong performance across diverse modalities.

Data scaling techniques like using Neural Style Transfer (NST) with contrastive learning (FundusNet (2304.06047)) and annotation-efficient methods for OCT segmentation [zhang_annotation-efficient_2023] have also been explored. SAM has been adapted for ophthalmology segmentation tasks, such as SAM-U (2308.07898) for uncertainty estimation in retinal images and SAM-OCTA (2309.11758) for OCTA image segmentation.

Model	Type	Imaging Domain	Backbone	Self-Supervised Learning	Data Size	Model Size	Weights Availability	Data Availability
RETFound	VM	CFP, OCT	ViT-Large	MAE	1.6M	307M	Yes	Yes/No*
FLAIR	VLM	CFP	ResNet-50	CLIP	285K	23M	Yes	Yes
VisionFM**	VM	Various***	-	-	3.4M	-	No	Yes/No****
DRStageNet	VM	CFP	ViT-Base	DINOv2	93.5K	86.9M	No	Yes
DERETFound	VLM	CFP	ViT-Large	MAE	150K + 1M*****	307M	Yes	Yes
RETFound-Green	VM	CFP	ViT-Small	Token Reconstruction	75K	22.2M	Yes******	Yes
EyeFound	VM	Various*******	ViT-Large	MAE	2.8M	307M	No	No

All datasets are publicly available except for the MIDAS dataset, which is subject to controlled access via an application process. **No architectural details available. *CFP, FFA, OCTA, OCT, Slit-Lamp, B-Scan Ultrasound. *All datasets are publicly available except for one private MRI dataset. **150K real and 1M synthetic generated images. **Planned to be released. ****CFP, FFA, ICGA, FAF, RetCam, Ocular Ultrasound, OCT, Slit-Lamp, External Eye Photo, Specular Microscope, Corneal Topography.

Compared to radiology, ophthalmology has fewer VLFMs extensively utilizing report data, possibly due to clinical workflows relying more on direct image interpretation than detailed paired reports [shweikh2023growing]. A key challenge in ophthalmology FM research is the lack of widespread inter-model comparisons on standardized benchmarks.

Challenges and Future Directions

Despite the significant progress, several challenges hinder the widespread clinical adoption of FMs in medical imaging.

Technical Challenges:

Lack of Open-Source Large-Scale Clinical Datasets: Privacy regulations and institutional policies limit access to large, diverse clinical datasets, impeding reproducibility and collaboration. Federated learning [li2025open] and differential privacy [wang2024fedmeki] could help.
Increased Computational Demands: Scaling FMs for high-resolution WSIs and 3D volumes requires significant computational resources. Developing more efficient architectures and training strategies tailored for volumetric data is crucial [dominic_improving_2023].
Scaling Limitations: Recent studies in pathology suggest a plateau in performance gains by simply scaling data and model size, emphasizing the need for better data curation, diverse benchmarks (beyond common public datasets), and domain-specific SSL algorithms [aben2024towards, chen2024benchmarking, campanella2024clinical]. Data quality and diversity across organs, pathologies, stains, and scanners are highlighted as critical [alber2025novel, dippel2024rudolfv].
Adapting to the Medical Domain: Generic SSL frameworks optimized for natural images may not fully capture unique medical image characteristics like lack of canonical orientation, color variation, and the importance of specific fields of view [kang2023benchmarking].

Practical Challenges in Clinical Settings:

Explainability: FMs are often black boxes, making it difficult for clinicians to understand predictions, identify biases, or trust results. Explainable AI (XAI) techniques are needed to provide transparent and interpretable insights [abbas_xdecompo_2022, pham_i-ai_2023].
Robustness and Domain Generalization: Models must be robust to variations introduced by different equipment, protocols, and institutions. Current models can be unrobust to confounding factors like center-specific artifacts [de2025current]. Methods like distillation can improve robustness [filiot2025distilling].
Fairness, Bias, and Accessibility: FMs can inherit biases from training data, leading to disparate performance across patient subgroups (e.g., by race or sex) [glocker_risk_2023, czum2023bias, khan_how_2023]. Ensuring diverse training data and developing debiasing strategies are essential. Accessibility is also a concern, as large FMs require substantial compute, highlighting the importance of efficient architectures [engelmann2024training] and PEFT [dutt_parameter-efficient_2023].
Regulation: Clinical adoption requires addressing risks like hallucinations (plausible but incorrect outputs) and developing mechanisms for continual learning [wang2024comprehensive] to adapt to new data and disease variants without catastrophic forgetting.

In conclusion, foundation models hold immense potential for transforming medical image analysis by enabling data-efficient, generalizable, and robust AI systems. While significant progress has been made across pathology, radiology, and ophthalmology, particularly through large-scale self-supervised learning and multimodal approaches, challenges related to data access, computational resources, domain adaptation, interpretability, bias, and regulation must be actively addressed for successful clinical deployment. Future research should focus on developing more efficient and robust FM architectures, improving data curation and diversity, creating standardized benchmarks, and advancing XAI and continual learning techniques tailored to the complexities of medical imaging.