Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foundation Models in Medical Imaging -- A Review and Outlook (2506.09095v3)

Published 10 Jun 2025 in eess.IV, cs.AI, and cs.CV

Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.

Summary

  • The paper demonstrates that foundation models pre-trained via self-supervised learning significantly reduce the need for extensive labeled data in medical imaging.
  • It reviews key methodologies such as contrastive learning, masked image modeling, and parameter-efficient fine-tuning applied to CNN and Vision Transformer architectures.
  • The review outlines how these models enhance diagnostic accuracy across pathology, radiology, and ophthalmology while addressing technical and clinical implementation challenges.

Foundation Models (FMs) are revolutionizing medical image analysis by addressing key challenges like data scarcity and the need for task-specific models. Unlike traditional supervised learning methods that require extensive labeled data, FMs are pre-trained on large collections of unlabeled medical images to learn general-purpose visual features. These features can then be adapted to various downstream clinical tasks, often with significantly less labeled data than required by traditional methods. This review examines the development, application, and challenges of FMs across pathology, radiology, and ophthalmology.

The core technical concepts underpinning FMs in medical imaging involve large-scale pre-training, self-supervised learning (SSL), and effective adaptation strategies. Large-scale pre-training leverages vast datasets to train encoder architectures, typically Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), to extract rich, generalizable features. ViTs (2010.11929) have gained prominence due to their scalability and ability to capture long-range dependencies, although CNNs [resnet] remain effective, especially with limited data or when integrated into hybrid architectures.

Self-supervised learning is crucial because large labeled datasets are scarce in medical imaging. SSL uses pretext tasks that generate supervision signals directly from the data. These can be discriminative methods like contrastive learning (e.g., SimCLR (2002.05709), MoCo (1911.05722), DINO (2104.14294)) which distinguish between different data views, or generative methods like Masked Image Modeling (MIM) (e.g., MAE (2111.06377), iBOT (2111.07832)) which reconstruct masked parts of the input. Multimodal SSL, using objectives like Image-Text Contrastive Learning (ITC) as in CLIP (2103.00020), aligns representations from different modalities, enabling Vision-Language FMs (VLFMs).

Once pre-trained, FMs are adapted to specific downstream tasks. The adaptation spectrum ranges from lightweight methods like prompt engineering (for VLMs) and linear probing (training a simple classifier on frozen embeddings) to more resource-intensive approaches. Adding task-specific heads on top of a frozen FM is common for classification or segmentation. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA (2106.09685) and adapter tuning (1909.08059), update only a small subset of parameters, reducing computational costs. Full fine-tuning, updating all parameters, often yields the best task-specific performance but requires more data and computation.

Foundation Models in Pathology

Computational pathology relies heavily on analyzing gigapixel Whole Slide Images (WSIs). Due to their size, WSIs are often processed as smaller tiles in a two-stage pipeline: a pre-trained encoder extracts tile-level embeddings, followed by a task-specific head (often using Multi-Instance Learning for slide-level tasks). Early approaches used ImageNet pre-training [saillard2020predicting], but recent efforts focus on in-domain SSL on large pathology datasets.

Tile-level pathology FMs have seen significant development using various SSL frameworks:

  • Contrastive Learning: SimCLR has been applied to large and diverse histopathology datasets, demonstrating improved embedding quality [ciga2022self]. REMEDIS [2023.09.04.23294952] combined supervised pre-training on natural images with SimCLR on medical data for enhanced robustness. Pathology-specific contrastive methods like cluster-guided contrastive learning (CCL) [wang2023retccl] and semantically relevant contrastive learning (SRCL) [wang2022transformer] address challenges posed by semantic similarities in pathology patches.
  • Self-distillation with DINO: The DINO framework has been used for ViT-based pathology FMs like HIPT (2206.02647), which captures hierarchical WSI structures, and models trained on massive proprietary datasets [campanella2023computational]. UNI (2308.15474) and Virchow (2309.07778) are prominent examples trained on large pan-cancer datasets using DINOv2 (2304.07193), showing state-of-the-art performance on various tasks. HistoEncoder (2411.11458) uses DINO on an efficient XCiT backbone.
  • Masked Image Modeling (MIM): Inspired by NLP, MIM has become popular for pathology FMs. Phikon (2307.21232) used iBOT (2111.07832) on TCGA data and scaled to different ViT sizes. Subsequent models like UNI (2308.15474), Virchow (2309.07778), RudolfV (2401.04079), PLUTO (2405.07905), Hibou-L (2406.05074), H-optimus-O [hoptimus0], Virchow2G (2408.00738), and Phikon-v2 (2409.09173) have adopted DINOv2 as the preferred MIM framework, often scaling to billions of patches and millions of WSIs. Atlas (2501.05409) demonstrated that data diversity (stains, scanners, magnifications) can be more impactful than sheer data volume alone.

Vision-Language FMs (VLFMs) in pathology combine image and text data, often using cross-modal contrastive objectives. Models like MI-Zero [lu2023visual] and PLIP [huang2023visual] adapt the CLIP framework, aligning image patches with text descriptions from pathology reports or other sources. QUILTNET [ikezogwo2024quilt] and CONCH [lu2024visual] further scale VLFM training. Generative VLFMs like PathAsst (2305.15072), PA-LLaVA (2408.09530), PathChat (2312.07814), and Quilt-LLaVA [seyfioglu2024quilt] are instruction-tuned for multimodal dialogue and visual question answering.

Slide-level FMs aim to capture global WSI context, which is challenging due to gigapixel size. LongViT (2312.03558) uses LongNet (2004.05150) with dilated attention to process long patch sequences end-to-end. Recent models like Prov-GigaPath (2404.03512) and PRISM (2405.10254) combine tile-level encoders with LongNet or aggregation methods and Vision-Language alignment. THREADS (2501.16652) integrates molecular supervision.

The Segment Anything Model (SAM) (2304.02643), originally trained on natural images, has been explored for pathology segmentation [chauveau2023segment]. While out-of-the-box performance varies [deng2023segment], fine-tuning and adaptation methods [ranem2023exploring, zhang2023sam] show promise for specific tasks like nuclei or tumor bud segmentation.

<br/>

Model Backbone SSL Pre-Training Data Size (Patches/WSIs) Source Weights Availability Data Availability
REMEDIS ResNet-152 SimCLR 50M (29K) TCGA Yes Yes
RetCCL ResNet-50 CCL 15M (32K) TCGA, PAIP Yes Yes
CTransPath SwinTransformer SRCL 15M (32K) TCGA, PAIP Yes Yes
HIPT ViT-HIPT DINO 104M (11K) TCGA Yes Yes
Campanella et al. ViT-S DINO 3B (400K) MSHS* No No
Lunit ViT-S/8, ViT-S/16 DINO 33M (37K) TCGA, TULIP* Yes Yes*/No*
Phikon ViT-B/16 iBOT 43M (6K) TCGA Yes Yes
UNI ViT-L/16 DINOv2 100M (100K) Mass-100K* Yes No*
Virchow ViT-H/14 DINOv2 2B (1.5M) MSKCC* Yes No
RudolfV ViT-L/14 DINOv2 1.2B (134K) TCGA, Proprietary No No
PLUTO ViT-S/8, ViT-S/16 Modified DINOv2 195M (158K) TCGA, PathAI* No No
Hibou-L ViT-L/14 DINOv2 1.2B (1.1M) Proprietary Yes No
H-optimus-O ViT-G DINOv2 - (500K) Proprietary No No
Virchow2G ViT-G/14 DINOv2 1.9B (3.1M) MSKCC* No No
Phikon-v2 ViT-L DINOv2 456M (58.4K) PANCAN-XL Yes Yes*/No*
Atlas ViT-H/14 RudolfV (DINOv2-based) 520M (1.2M) Proprietary No No

*Proprietary/Partially available data. <br/>

Model Type Backbone SSL Pre-Training Data Size (Image-Text Pairs) Source Weights Availability Data Availability
PLIP VLFM ViT-B/32 + Transformer ITC 208K OpenPath Yes Yes
MI-Zero VLFM CTransPath + GPT2-medium ITC - (33K) Proprietary Yes No
QuiltNet VLFM ViT-B/32 + PubmedBert ITC 34K QUILT Yes Yes
CONCH VLFM ViT-B/16 + GPT2-medium iBOT* + CoCa 1.1M PMC-Path, EDU Yes Yes
PathCLIP VLFM CLIP vision + Vicuna-13b text ITC 207K PathCap Yes Yes
PA-LLaVA VLFM PLIP vision + Lama3 text ITC + ITM 1.4M PMV, PMC-OA, Quilt-1M Yes Yes
PathChat VLFM Uni + LLama2 CoCa 100K PMC-OA, Proprietary Yes*/No* Yes*/No*
Quilt-LLaVA VLFM QuiltNet vision + Vicuna text ITC 723K Quilt-1M Yes Yes

*iBOT is used for training their own image encoder. <br/>

Model Type Backbone SSL Pre-training Data Size (WSIs) Source Weights Availability Data Availability
Giga-SSL VFM ResNet-18 SimCLR 12K TCGA Yes Yes
LongViT VFM LongNet DINO 10K TCGA Yes Yes
PRISM VLFM Virchow + BioGPT CoCa 590K Proprietary Yes No
Prov-GigaPath VLFM ViT-G + PubMedBERT DINOv2 + ITC 170K Prov-Path Yes No
mSTAR VLFM ViT-L mSTAR 26K TCGA No No
KEEP VLFM Uni + PubMedBERT Knowledge-enhanced ITC 143K OpenPath, Quilt-1M Yes Yes
CHIEF VLFM CTransPath + CLIP text SRCL (v) + ITC (vl) 61K Proprietary Yes No
TITAN VLFM ViT-B + CONCHv1.5 iBOT (v) + CoCa (vl) 60K GTEx Yes Yes
COBRA VLFM ABMIL + Mamba-2 COBRA 3K mTCGA Yes Yes
THREADS VFM CONCHV1.5 (patch) + cGPT (gene) Contrastive 47K MBTG-47K No* Yes*/No*

*Authors train both an image encoder and a VL-alignment module. Partially available: TCGA and GTEx are open-source, but pretraining data from BWH and MGH are proprietary. Planned to be released.

Foundation Models in Radiology

Radiology encompasses diverse modalities (X-ray, CT, MRI, US, PET), presenting challenges due to varied data formats (2D, 3D) and non-standardized reports. Early models focused on transformer architectures for specific tasks without large-scale SSL. Contrastive vision-language learning emerged as an early SSL approach, aligning images with text descriptions [huang_gloria_2021].

Generalist VLFMs aim to handle multiple medical imaging modalities, often using datasets from sources like PubMed Central. Models like PubMed-CLIP (2112.13906), PMC-CLIP (2303.07240), BiomedCLIP (2303.00915), and MEDVInT (2305.10415) leverage large image-text pairs for multimodal tasks. LLaVA-Med (2306.00890), Qilin-Med-VL (2310.17956), and Med-Flamingo (2308.01390) specialize in Medical Visual Question Answering (MedVQA).

<br/>

Model Type Imaging Domain Backbone SSL Objective Data Size Weights Availability Data Availability
PubMedCLIP VLM 2D Radiology* Resnet50 + ViT-B/32 ITC 80K Yes Yes
UniMiSS VM CT(3D), X-Ray MiT MIM 15K No Yes
BiomedCLIP VLM Multimodal** ViT-B/16 + PubMedBERT ITC 15M Yes Yes
PMC-CLIP VLM 2D Radiology* ResNet50 + PubmedBERT ITC + MLM 1.65M Yes Yes
MEDVInT VLM Multimodal** ResNet50 + Transformer MLM 227K No Yes
LLaVA-Med VLM Multimodal** LLaVA ITC 60K Yes Yes
LVM-Med VM CT(2D), MRI, X-ray, US ResNet-50 + ViT Graph Matching 1.3M No Yes
Med-Flamingo VLM Multimodal** ViT/L-14 + LLaMA-7B ITC 1.6M Yes Yes
RadFM VLM Various Radiologies* 3D ViT + MedLLaMA-13B Generative ITC 16M*** Yes Yes
Qilin-Med-VL VLM Multimodal** ViT/L-14 + LLaMA-13B ITC 580K Yes Yes
SAT VLM Multimodal** 3D U-Net + BioBERT ITC 302K No Yes
VISION-MAE VM Various Radiologies* Swin-T MAE 2.5M No No
RadCLIP VLM CT (2D+3D), X-Ray, MRI ViT-L/14 Contrastive 1.2M No Yes

Wide range of 2D radiological imaging types. **Multimodal denotes broad range of radiological imaging types, as well as pathological image types. **2D image-text pairs, including 15.5M 2D images and 500k 3D images.

Chest X-ray (CXR) is a subdomain with more FM development due to data availability and lower complexity. Models like MedCLIP (2210.10163) leverage unpaired data, while CheXzero (2212.00751) and CXR-CLIP (2307.07645) focus on high-quality paired data using contrastive learning and radiologist-designed prompts. KAD (2307.22287) incorporates medical knowledge graphs. UniChest (2403.13405) uses a conquer-and-divide framework for multi-source data generalization. ELIXR (2308.01317) and MAIRA-1 (2311.13668) (and MAIRA-2 (2406.04449)) use adapters for efficient VLM fine-tuning. CheXAgent (2401.12208) scaled instruction-tuned learning on a 6M sample dataset.

<br/>

Model Type Backbone SSL Objective Data Size (Volumes/Pairs) Weights Availability Data Availability
BioViL-T VLFM Custom CNN–Transformer + BERT Temporal Multi-Modal 174.1K No Yes
ELIXR VLFM SupCon + T5 ITC 220K No Yes
MaCo VLFM ViT-B/16 + BERT MLM 377K No Yes
CXR-CLIP VLFM ResNet50 + BioClinicalBERT ITC 15M Yes Yes
UniChest VLFM ResNet50 + BioClinicalBERT ITC 686K Yes Yes
RAD-DINO VFM ViT-B/14 DinoV2 838K No Yes/No*
CheXagent VLFM Ensemble ITC + IC 6.1M** No* No*
Ray-DINO VFM ViT-L DinoV2 863K No Yes

*Partially available. Planned to be released. **Does not provide pre-training dataset size, but finetunes on 6.1M samples.

Extending FMs to 3D modalities like CT and MRI is challenging. Many early VLFMs for 3D data used 2D slices [lei2023clip]. While models like M3FM [niu2023medical] and MedBLIP [chen2023medblip] incorporated 3D data by processing patches, fully integrated 3D models are less common. RadFM (2308.02463) processed full 3D volumes and introduced a large 2D/3D dataset, but was limited in vision-specific tasks. M3D-LaMed (2404.00578) and SAT (2312.17183) improved 3D segmentation capabilities. Rad-CLIP (2403.09948) and CT-CLIP (2403.17834) adapted CLIP for 3D, with the latter using a 3D ViT on a dedicated CT dataset. Merlin (2406.06512) integrates structured EHR and unstructured text with 3D CT.

Vision-Only models are emerging in radiology for spatially focused tasks. Early efforts used multi-view or sequence-based approaches for 3D data [jun2021medicaltransformeruniversalbrain]. Recent models employ knowledge distillation [ye_desd_2022, jiang_self-supervised_2022], MIM [xie2022unimiss], or graph-based SSL [nguyen_lvm-med_2023]. RAD-DINO (2401.10815) and Ray-DINO (2405.01469) adapted DINOv2 for medical imaging, demonstrating strong performance and generalization from image-only pre-training. VISION-MAE (2402.01034) applied MAE to 3D data. SAM has also been evaluated for radiology segmentation [mazurowski2023segment, roy_sammd_2023, gao_desam_2023], showing versatility but facing domain adaptation challenges.

<br/>

Model Type Imaging Domain Backbone SSL Objective Data Size (Volumes) Weights Availability Data Availability
DeSD VM CT 3D ResNet50 DeSD 11K No Yes
SMIT VM CT, MRI SWIN-small MIM + Self-Distillation 3643 Yes No
Medical Transformer VM MRI ResNet-18 + Transformer MAE 1783 No Yes
M3FM VLM CT 3D CT-ViT + Transformer ITC 163K Yes Yes
CLIP-LUNG VLM CT ViT-B/16 + ResNet18 ITC 1010 No Yes
Niu et al. VM CT 3D ViT Region-Contrastive 684 No Yes
MedBLIP VLM MRI MedQFormer + BioMedLM* ITC 30K Yes Yes
MeTSK VM fMRI STGCN GraphCL 1415 No Yes
Pai et al. VM CT 3D ResNet Modified SimCLR 11.5K Yes Yes
M3D-LaMed VLM CT 3D ViT + LLaMA-2 ITC 120K Yes Yes
Merlin VLM CT I3D ResNet152 + Clinical-Longformer Contrastive 25K No* No*
CT-CLIP VLM CT 3D CT-ViT + CXR-Bert ITC 26K Yes Yes

*Planned to be released. Authors experiment with multiple language encoders. Table shows their best performing model architecture.

Foundation Models in Ophthalmology

Ophthalmology benefits from relatively larger datasets for modalities like Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). These images are rich in detail and can indicate both eye diseases and systemic conditions.

The first ophthalmology FM was RETFound (2303.10126), a ViT pre-trained on 1.6 million retinal images using MAE. It showed strong performance on detecting eye and systemic diseases with minimal fine-tuning. Variants like RETFound-Green (2405.00117) focused on efficiency, while DERETFound [yan2025expertise] integrated synthetic data and expert text tagging. FLAIR (2308.07898) used CLIP to align CFP images with text descriptions for enhanced performance. DRStageNet (2312.14891) is a CFP model specifically tailored for diabetic retinopathy staging using DINOv2.

Multi-imaging ophthalmology FMs aim to handle diverse modalities beyond just retina. VisionFM (2310.04992) integrates multiple ophthalmic imaging types for multi-task diagnosis and segmentation, though specific architectural details are limited. EyeFound (2405.11338) is a general ophthalmic FM with a large ViT backbone and MAE pre-training, showing strong performance across diverse modalities.

Data scaling techniques like using Neural Style Transfer (NST) with contrastive learning (FundusNet (2304.06047)) and annotation-efficient methods for OCT segmentation [zhang_annotation-efficient_2023] have also been explored. SAM has been adapted for ophthalmology segmentation tasks, such as SAM-U (2308.07898) for uncertainty estimation in retinal images and SAM-OCTA (2309.11758) for OCTA image segmentation.

<br/>

Model Type Imaging Domain Backbone Self-Supervised Learning Data Size Model Size Weights Availability Data Availability
RETFound VM CFP, OCT ViT-Large MAE 1.6M 307M Yes Yes/No*
FLAIR VLM CFP ResNet-50 CLIP 285K 23M Yes Yes
VisionFM** VM Various*** - - 3.4M - No Yes/No****
DRStageNet VM CFP ViT-Base DINOv2 93.5K 86.9M No Yes
DERETFound VLM CFP ViT-Large MAE 150K + 1M***** 307M Yes Yes
RETFound-Green VM CFP ViT-Small Token Reconstruction 75K 22.2M Yes****** Yes
EyeFound VM Various******* ViT-Large MAE 2.8M 307M No No

All datasets are publicly available except for the MIDAS dataset, which is subject to controlled access via an application process. **No architectural details available. *CFP, FFA, OCTA, OCT, Slit-Lamp, B-Scan Ultrasound. *All datasets are publicly available except for one private MRI dataset. **150K real and 1M synthetic generated images. **Planned to be released. ****CFP, FFA, ICGA, FAF, RetCam, Ocular Ultrasound, OCT, Slit-Lamp, External Eye Photo, Specular Microscope, Corneal Topography.

Compared to radiology, ophthalmology has fewer VLFMs extensively utilizing report data, possibly due to clinical workflows relying more on direct image interpretation than detailed paired reports [shweikh2023growing]. A key challenge in ophthalmology FM research is the lack of widespread inter-model comparisons on standardized benchmarks.

Challenges and Future Directions

Despite the significant progress, several challenges hinder the widespread clinical adoption of FMs in medical imaging.

Technical Challenges:

  • Lack of Open-Source Large-Scale Clinical Datasets: Privacy regulations and institutional policies limit access to large, diverse clinical datasets, impeding reproducibility and collaboration. Federated learning [li2025open] and differential privacy [wang2024fedmeki] could help.
  • Increased Computational Demands: Scaling FMs for high-resolution WSIs and 3D volumes requires significant computational resources. Developing more efficient architectures and training strategies tailored for volumetric data is crucial [dominic_improving_2023].
  • Scaling Limitations: Recent studies in pathology suggest a plateau in performance gains by simply scaling data and model size, emphasizing the need for better data curation, diverse benchmarks (beyond common public datasets), and domain-specific SSL algorithms [aben2024towards, chen2024benchmarking, campanella2024clinical]. Data quality and diversity across organs, pathologies, stains, and scanners are highlighted as critical [alber2025novel, dippel2024rudolfv].
  • Adapting to the Medical Domain: Generic SSL frameworks optimized for natural images may not fully capture unique medical image characteristics like lack of canonical orientation, color variation, and the importance of specific fields of view [kang2023benchmarking].

Practical Challenges in Clinical Settings:

  • Explainability: FMs are often black boxes, making it difficult for clinicians to understand predictions, identify biases, or trust results. Explainable AI (XAI) techniques are needed to provide transparent and interpretable insights [abbas_xdecompo_2022, pham_i-ai_2023].
  • Robustness and Domain Generalization: Models must be robust to variations introduced by different equipment, protocols, and institutions. Current models can be unrobust to confounding factors like center-specific artifacts [de2025current]. Methods like distillation can improve robustness [filiot2025distilling].
  • Fairness, Bias, and Accessibility: FMs can inherit biases from training data, leading to disparate performance across patient subgroups (e.g., by race or sex) [glocker_risk_2023, czum2023bias, khan_how_2023]. Ensuring diverse training data and developing debiasing strategies are essential. Accessibility is also a concern, as large FMs require substantial compute, highlighting the importance of efficient architectures [engelmann2024training] and PEFT [dutt_parameter-efficient_2023].
  • Regulation: Clinical adoption requires addressing risks like hallucinations (plausible but incorrect outputs) and developing mechanisms for continual learning [wang2024comprehensive] to adapt to new data and disease variants without catastrophic forgetting.

In conclusion, foundation models hold immense potential for transforming medical image analysis by enabling data-efficient, generalizable, and robust AI systems. While significant progress has been made across pathology, radiology, and ophthalmology, particularly through large-scale self-supervised learning and multimodal approaches, challenges related to data access, computational resources, domain adaptation, interpretability, bias, and regulation must be actively addressed for successful clinical deployment. Future research should focus on developing more efficient and robust FM architectures, improving data curation and diversity, creating standardized benchmarks, and advancing XAI and continual learning techniques tailored to the complexities of medical imaging.