Foundation Models in Medical Imaging -- A Review and Outlook (2506.09095v3)
Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
Summary
- The paper demonstrates that foundation models pre-trained via self-supervised learning significantly reduce the need for extensive labeled data in medical imaging.
- It reviews key methodologies such as contrastive learning, masked image modeling, and parameter-efficient fine-tuning applied to CNN and Vision Transformer architectures.
- The review outlines how these models enhance diagnostic accuracy across pathology, radiology, and ophthalmology while addressing technical and clinical implementation challenges.
Foundation Models (FMs) are revolutionizing medical image analysis by addressing key challenges like data scarcity and the need for task-specific models. Unlike traditional supervised learning methods that require extensive labeled data, FMs are pre-trained on large collections of unlabeled medical images to learn general-purpose visual features. These features can then be adapted to various downstream clinical tasks, often with significantly less labeled data than required by traditional methods. This review examines the development, application, and challenges of FMs across pathology, radiology, and ophthalmology.
The core technical concepts underpinning FMs in medical imaging involve large-scale pre-training, self-supervised learning (SSL), and effective adaptation strategies. Large-scale pre-training leverages vast datasets to train encoder architectures, typically Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), to extract rich, generalizable features. ViTs (2010.11929) have gained prominence due to their scalability and ability to capture long-range dependencies, although CNNs [resnet] remain effective, especially with limited data or when integrated into hybrid architectures.
Self-supervised learning is crucial because large labeled datasets are scarce in medical imaging. SSL uses pretext tasks that generate supervision signals directly from the data. These can be discriminative methods like contrastive learning (e.g., SimCLR (2002.05709), MoCo (1911.05722), DINO (2104.14294)) which distinguish between different data views, or generative methods like Masked Image Modeling (MIM) (e.g., MAE (2111.06377), iBOT (2111.07832)) which reconstruct masked parts of the input. Multimodal SSL, using objectives like Image-Text Contrastive Learning (ITC) as in CLIP (2103.00020), aligns representations from different modalities, enabling Vision-Language FMs (VLFMs).
Once pre-trained, FMs are adapted to specific downstream tasks. The adaptation spectrum ranges from lightweight methods like prompt engineering (for VLMs) and linear probing (training a simple classifier on frozen embeddings) to more resource-intensive approaches. Adding task-specific heads on top of a frozen FM is common for classification or segmentation. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA (2106.09685) and adapter tuning (1909.08059), update only a small subset of parameters, reducing computational costs. Full fine-tuning, updating all parameters, often yields the best task-specific performance but requires more data and computation.
Foundation Models in Pathology
Computational pathology relies heavily on analyzing gigapixel Whole Slide Images (WSIs). Due to their size, WSIs are often processed as smaller tiles in a two-stage pipeline: a pre-trained encoder extracts tile-level embeddings, followed by a task-specific head (often using Multi-Instance Learning for slide-level tasks). Early approaches used ImageNet pre-training [saillard2020predicting], but recent efforts focus on in-domain SSL on large pathology datasets.
Tile-level pathology FMs have seen significant development using various SSL frameworks:
- Contrastive Learning: SimCLR has been applied to large and diverse histopathology datasets, demonstrating improved embedding quality [ciga2022self]. REMEDIS [2023.09.04.23294952] combined supervised pre-training on natural images with SimCLR on medical data for enhanced robustness. Pathology-specific contrastive methods like cluster-guided contrastive learning (CCL) [wang2023retccl] and semantically relevant contrastive learning (SRCL) [wang2022transformer] address challenges posed by semantic similarities in pathology patches.
- Self-distillation with DINO: The DINO framework has been used for ViT-based pathology FMs like HIPT (2206.02647), which captures hierarchical WSI structures, and models trained on massive proprietary datasets [campanella2023computational]. UNI (2308.15474) and Virchow (2309.07778) are prominent examples trained on large pan-cancer datasets using DINOv2 (2304.07193), showing state-of-the-art performance on various tasks. HistoEncoder (2411.11458) uses DINO on an efficient XCiT backbone.
- Masked Image Modeling (MIM): Inspired by NLP, MIM has become popular for pathology FMs. Phikon (2307.21232) used iBOT (2111.07832) on TCGA data and scaled to different ViT sizes. Subsequent models like UNI (2308.15474), Virchow (2309.07778), RudolfV (2401.04079), PLUTO (2405.07905), Hibou-L (2406.05074), H-optimus-O [hoptimus0], Virchow2G (2408.00738), and Phikon-v2 (2409.09173) have adopted DINOv2 as the preferred MIM framework, often scaling to billions of patches and millions of WSIs. Atlas (2501.05409) demonstrated that data diversity (stains, scanners, magnifications) can be more impactful than sheer data volume alone.
Vision-Language FMs (VLFMs) in pathology combine image and text data, often using cross-modal contrastive objectives. Models like MI-Zero [lu2023visual] and PLIP [huang2023visual] adapt the CLIP framework, aligning image patches with text descriptions from pathology reports or other sources. QUILTNET [ikezogwo2024quilt] and CONCH [lu2024visual] further scale VLFM training. Generative VLFMs like PathAsst (2305.15072), PA-LLaVA (2408.09530), PathChat (2312.07814), and Quilt-LLaVA [seyfioglu2024quilt] are instruction-tuned for multimodal dialogue and visual question answering.
Slide-level FMs aim to capture global WSI context, which is challenging due to gigapixel size. LongViT (2312.03558) uses LongNet (2004.05150) with dilated attention to process long patch sequences end-to-end. Recent models like Prov-GigaPath (2404.03512) and PRISM (2405.10254) combine tile-level encoders with LongNet or aggregation methods and Vision-Language alignment. THREADS (2501.16652) integrates molecular supervision.
The Segment Anything Model (SAM) (2304.02643), originally trained on natural images, has been explored for pathology segmentation [chauveau2023segment]. While out-of-the-box performance varies [deng2023segment], fine-tuning and adaptation methods [ranem2023exploring, zhang2023sam] show promise for specific tasks like nuclei or tumor bud segmentation.
<br/>
Model | Backbone | SSL | Pre-Training Data Size (Patches/WSIs) | Source | Weights Availability | Data Availability |
---|---|---|---|---|---|---|
REMEDIS | ResNet-152 | SimCLR | 50M (29K) | TCGA | Yes | Yes |
RetCCL | ResNet-50 | CCL | 15M (32K) | TCGA, PAIP | Yes | Yes |
CTransPath | SwinTransformer | SRCL | 15M (32K) | TCGA, PAIP | Yes | Yes |
HIPT | ViT-HIPT | DINO | 104M (11K) | TCGA | Yes | Yes |
Campanella et al. | ViT-S | DINO | 3B (400K) | MSHS* | No | No |
Lunit | ViT-S/8, ViT-S/16 | DINO | 33M (37K) | TCGA, TULIP* | Yes | Yes*/No* |
Phikon | ViT-B/16 | iBOT | 43M (6K) | TCGA | Yes | Yes |
UNI | ViT-L/16 | DINOv2 | 100M (100K) | Mass-100K* | Yes | No* |
Virchow | ViT-H/14 | DINOv2 | 2B (1.5M) | MSKCC* | Yes | No |
RudolfV | ViT-L/14 | DINOv2 | 1.2B (134K) | TCGA, Proprietary | No | No |
PLUTO | ViT-S/8, ViT-S/16 | Modified DINOv2 | 195M (158K) | TCGA, PathAI* | No | No |
Hibou-L | ViT-L/14 | DINOv2 | 1.2B (1.1M) | Proprietary | Yes | No |
H-optimus-O | ViT-G | DINOv2 | - (500K) | Proprietary | No | No |
Virchow2G | ViT-G/14 | DINOv2 | 1.9B (3.1M) | MSKCC* | No | No |
Phikon-v2 | ViT-L | DINOv2 | 456M (58.4K) | PANCAN-XL | Yes | Yes*/No* |
Atlas | ViT-H/14 | RudolfV (DINOv2-based) | 520M (1.2M) | Proprietary | No | No |
*Proprietary/Partially available data. <br/>
Model | Type | Backbone | SSL | Pre-Training Data Size (Image-Text Pairs) | Source | Weights Availability | Data Availability |
---|---|---|---|---|---|---|---|
PLIP | VLFM | ViT-B/32 + Transformer | ITC | 208K | OpenPath | Yes | Yes |
MI-Zero | VLFM | CTransPath + GPT2-medium | ITC | - (33K) | Proprietary | Yes | No |
QuiltNet | VLFM | ViT-B/32 + PubmedBert | ITC | 34K | QUILT | Yes | Yes |
CONCH | VLFM | ViT-B/16 + GPT2-medium | iBOT* + CoCa | 1.1M | PMC-Path, EDU | Yes | Yes |
PathCLIP | VLFM | CLIP vision + Vicuna-13b text | ITC | 207K | PathCap | Yes | Yes |
PA-LLaVA | VLFM | PLIP vision + Lama3 text | ITC + ITM | 1.4M | PMV, PMC-OA, Quilt-1M | Yes | Yes |
PathChat | VLFM | Uni + LLama2 | CoCa | 100K | PMC-OA, Proprietary | Yes*/No* | Yes*/No* |
Quilt-LLaVA | VLFM | QuiltNet vision + Vicuna text | ITC | 723K | Quilt-1M | Yes | Yes |
*iBOT is used for training their own image encoder. <br/>
Model | Type | Backbone | SSL | Pre-training Data Size (WSIs) | Source | Weights Availability | Data Availability |
---|---|---|---|---|---|---|---|
Giga-SSL | VFM | ResNet-18 | SimCLR | 12K | TCGA | Yes | Yes |
LongViT | VFM | LongNet | DINO | 10K | TCGA | Yes | Yes |
PRISM | VLFM | Virchow + BioGPT | CoCa | 590K | Proprietary | Yes | No |
Prov-GigaPath | VLFM | ViT-G + PubMedBERT | DINOv2 + ITC | 170K | Prov-Path | Yes | No |
mSTAR | VLFM | ViT-L | mSTAR | 26K | TCGA | No | No |
KEEP | VLFM | Uni + PubMedBERT | Knowledge-enhanced ITC | 143K | OpenPath, Quilt-1M | Yes | Yes |
CHIEF | VLFM | CTransPath + CLIP text | SRCL (v) + ITC (vl) | 61K | Proprietary | Yes | No |
TITAN | VLFM | ViT-B + CONCHv1.5 | iBOT (v) + CoCa (vl) | 60K | GTEx | Yes | Yes |
COBRA | VLFM | ABMIL + Mamba-2 | COBRA | 3K | mTCGA | Yes | Yes |
THREADS | VFM | CONCHV1.5 (patch) + cGPT (gene) | Contrastive | 47K | MBTG-47K | No* | Yes*/No* |
*Authors train both an image encoder and a VL-alignment module. Partially available: TCGA and GTEx are open-source, but pretraining data from BWH and MGH are proprietary. Planned to be released.
Foundation Models in Radiology
Radiology encompasses diverse modalities (X-ray, CT, MRI, US, PET), presenting challenges due to varied data formats (2D, 3D) and non-standardized reports. Early models focused on transformer architectures for specific tasks without large-scale SSL. Contrastive vision-language learning emerged as an early SSL approach, aligning images with text descriptions [huang_gloria_2021].
Generalist VLFMs aim to handle multiple medical imaging modalities, often using datasets from sources like PubMed Central. Models like PubMed-CLIP (2112.13906), PMC-CLIP (2303.07240), BiomedCLIP (2303.00915), and MEDVInT (2305.10415) leverage large image-text pairs for multimodal tasks. LLaVA-Med (2306.00890), Qilin-Med-VL (2310.17956), and Med-Flamingo (2308.01390) specialize in Medical Visual Question Answering (MedVQA).
<br/>
Model | Type | Imaging Domain | Backbone | SSL Objective | Data Size | Weights Availability | Data Availability |
---|---|---|---|---|---|---|---|
PubMedCLIP | VLM | 2D Radiology* | Resnet50 + ViT-B/32 | ITC | 80K | Yes | Yes |
UniMiSS | VM | CT(3D), X-Ray | MiT | MIM | 15K | No | Yes |
BiomedCLIP | VLM | Multimodal** | ViT-B/16 + PubMedBERT | ITC | 15M | Yes | Yes |
PMC-CLIP | VLM | 2D Radiology* | ResNet50 + PubmedBERT | ITC + MLM | 1.65M | Yes | Yes |
MEDVInT | VLM | Multimodal** | ResNet50 + Transformer | MLM | 227K | No | Yes |
LLaVA-Med | VLM | Multimodal** | LLaVA | ITC | 60K | Yes | Yes |
LVM-Med | VM | CT(2D), MRI, X-ray, US | ResNet-50 + ViT | Graph Matching | 1.3M | No | Yes |
Med-Flamingo | VLM | Multimodal** | ViT/L-14 + LLaMA-7B | ITC | 1.6M | Yes | Yes |
RadFM | VLM | Various Radiologies* | 3D ViT + MedLLaMA-13B | Generative ITC | 16M*** | Yes | Yes |
Qilin-Med-VL | VLM | Multimodal** | ViT/L-14 + LLaMA-13B | ITC | 580K | Yes | Yes |
SAT | VLM | Multimodal** | 3D U-Net + BioBERT | ITC | 302K | No | Yes |
VISION-MAE | VM | Various Radiologies* | Swin-T | MAE | 2.5M | No | No |
RadCLIP | VLM | CT (2D+3D), X-Ray, MRI | ViT-L/14 | Contrastive | 1.2M | No | Yes |
Wide range of 2D radiological imaging types. **Multimodal denotes broad range of radiological imaging types, as well as pathological image types. **2D image-text pairs, including 15.5M 2D images and 500k 3D images.
Chest X-ray (CXR) is a subdomain with more FM development due to data availability and lower complexity. Models like MedCLIP (2210.10163) leverage unpaired data, while CheXzero (2212.00751) and CXR-CLIP (2307.07645) focus on high-quality paired data using contrastive learning and radiologist-designed prompts. KAD (2307.22287) incorporates medical knowledge graphs. UniChest (2403.13405) uses a conquer-and-divide framework for multi-source data generalization. ELIXR (2308.01317) and MAIRA-1 (2311.13668) (and MAIRA-2 (2406.04449)) use adapters for efficient VLM fine-tuning. CheXAgent (2401.12208) scaled instruction-tuned learning on a 6M sample dataset.
<br/>
Model | Type | Backbone | SSL Objective | Data Size (Volumes/Pairs) | Weights Availability | Data Availability |
---|---|---|---|---|---|---|
BioViL-T | VLFM | Custom CNN–Transformer + BERT | Temporal Multi-Modal | 174.1K | No | Yes |
ELIXR | VLFM | SupCon + T5 | ITC | 220K | No | Yes |
MaCo | VLFM | ViT-B/16 + BERT | MLM | 377K | No | Yes |
CXR-CLIP | VLFM | ResNet50 + BioClinicalBERT | ITC | 15M | Yes | Yes |
UniChest | VLFM | ResNet50 + BioClinicalBERT | ITC | 686K | Yes | Yes |
RAD-DINO | VFM | ViT-B/14 | DinoV2 | 838K | No | Yes/No* |
CheXagent | VLFM | Ensemble | ITC + IC | 6.1M** | No* | No* |
Ray-DINO | VFM | ViT-L | DinoV2 | 863K | No | Yes |
*Partially available. Planned to be released. **Does not provide pre-training dataset size, but finetunes on 6.1M samples.
Extending FMs to 3D modalities like CT and MRI is challenging. Many early VLFMs for 3D data used 2D slices [lei2023clip]. While models like M3FM [niu2023medical] and MedBLIP [chen2023medblip] incorporated 3D data by processing patches, fully integrated 3D models are less common. RadFM (2308.02463) processed full 3D volumes and introduced a large 2D/3D dataset, but was limited in vision-specific tasks. M3D-LaMed (2404.00578) and SAT (2312.17183) improved 3D segmentation capabilities. Rad-CLIP (2403.09948) and CT-CLIP (2403.17834) adapted CLIP for 3D, with the latter using a 3D ViT on a dedicated CT dataset. Merlin (2406.06512) integrates structured EHR and unstructured text with 3D CT.
Vision-Only models are emerging in radiology for spatially focused tasks. Early efforts used multi-view or sequence-based approaches for 3D data [jun2021medicaltransformeruniversalbrain]. Recent models employ knowledge distillation [ye_desd_2022, jiang_self-supervised_2022], MIM [xie2022unimiss], or graph-based SSL [nguyen_lvm-med_2023]. RAD-DINO (2401.10815) and Ray-DINO (2405.01469) adapted DINOv2 for medical imaging, demonstrating strong performance and generalization from image-only pre-training. VISION-MAE (2402.01034) applied MAE to 3D data. SAM has also been evaluated for radiology segmentation [mazurowski2023segment, roy_sammd_2023, gao_desam_2023], showing versatility but facing domain adaptation challenges.
<br/>
Model | Type | Imaging Domain | Backbone | SSL Objective | Data Size (Volumes) | Weights Availability | Data Availability |
---|---|---|---|---|---|---|---|
DeSD | VM | CT | 3D ResNet50 | DeSD | 11K | No | Yes |
SMIT | VM | CT, MRI | SWIN-small | MIM + Self-Distillation | 3643 | Yes | No |
Medical Transformer | VM | MRI | ResNet-18 + Transformer | MAE | 1783 | No | Yes |
M3FM | VLM | CT | 3D CT-ViT + Transformer | ITC | 163K | Yes | Yes |
CLIP-LUNG | VLM | CT | ViT-B/16 + ResNet18 | ITC | 1010 | No | Yes |
Niu et al. | VM | CT | 3D ViT | Region-Contrastive | 684 | No | Yes |
MedBLIP | VLM | MRI | MedQFormer + BioMedLM* | ITC | 30K | Yes | Yes |
MeTSK | VM | fMRI | STGCN | GraphCL | 1415 | No | Yes |
Pai et al. | VM | CT | 3D ResNet | Modified SimCLR | 11.5K | Yes | Yes |
M3D-LaMed | VLM | CT | 3D ViT + LLaMA-2 | ITC | 120K | Yes | Yes |
Merlin | VLM | CT | I3D ResNet152 + Clinical-Longformer | Contrastive | 25K | No* | No* |
CT-CLIP | VLM | CT | 3D CT-ViT + CXR-Bert | ITC | 26K | Yes | Yes |
*Planned to be released. Authors experiment with multiple language encoders. Table shows their best performing model architecture.
Foundation Models in Ophthalmology
Ophthalmology benefits from relatively larger datasets for modalities like Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). These images are rich in detail and can indicate both eye diseases and systemic conditions.
The first ophthalmology FM was RETFound (2303.10126), a ViT pre-trained on 1.6 million retinal images using MAE. It showed strong performance on detecting eye and systemic diseases with minimal fine-tuning. Variants like RETFound-Green (2405.00117) focused on efficiency, while DERETFound [yan2025expertise] integrated synthetic data and expert text tagging. FLAIR (2308.07898) used CLIP to align CFP images with text descriptions for enhanced performance. DRStageNet (2312.14891) is a CFP model specifically tailored for diabetic retinopathy staging using DINOv2.
Multi-imaging ophthalmology FMs aim to handle diverse modalities beyond just retina. VisionFM (2310.04992) integrates multiple ophthalmic imaging types for multi-task diagnosis and segmentation, though specific architectural details are limited. EyeFound (2405.11338) is a general ophthalmic FM with a large ViT backbone and MAE pre-training, showing strong performance across diverse modalities.
Data scaling techniques like using Neural Style Transfer (NST) with contrastive learning (FundusNet (2304.06047)) and annotation-efficient methods for OCT segmentation [zhang_annotation-efficient_2023] have also been explored. SAM has been adapted for ophthalmology segmentation tasks, such as SAM-U (2308.07898) for uncertainty estimation in retinal images and SAM-OCTA (2309.11758) for OCTA image segmentation.
<br/>
Model | Type | Imaging Domain | Backbone | Self-Supervised Learning | Data Size | Model Size | Weights Availability | Data Availability |
---|---|---|---|---|---|---|---|---|
RETFound | VM | CFP, OCT | ViT-Large | MAE | 1.6M | 307M | Yes | Yes/No* |
FLAIR | VLM | CFP | ResNet-50 | CLIP | 285K | 23M | Yes | Yes |
VisionFM** | VM | Various*** | - | - | 3.4M | - | No | Yes/No**** |
DRStageNet | VM | CFP | ViT-Base | DINOv2 | 93.5K | 86.9M | No | Yes |
DERETFound | VLM | CFP | ViT-Large | MAE | 150K + 1M***** | 307M | Yes | Yes |
RETFound-Green | VM | CFP | ViT-Small | Token Reconstruction | 75K | 22.2M | Yes****** | Yes |
EyeFound | VM | Various******* | ViT-Large | MAE | 2.8M | 307M | No | No |
All datasets are publicly available except for the MIDAS dataset, which is subject to controlled access via an application process. **No architectural details available. *CFP, FFA, OCTA, OCT, Slit-Lamp, B-Scan Ultrasound. *All datasets are publicly available except for one private MRI dataset. **150K real and 1M synthetic generated images. **Planned to be released. ****CFP, FFA, ICGA, FAF, RetCam, Ocular Ultrasound, OCT, Slit-Lamp, External Eye Photo, Specular Microscope, Corneal Topography.
Compared to radiology, ophthalmology has fewer VLFMs extensively utilizing report data, possibly due to clinical workflows relying more on direct image interpretation than detailed paired reports [shweikh2023growing]. A key challenge in ophthalmology FM research is the lack of widespread inter-model comparisons on standardized benchmarks.
Challenges and Future Directions
Despite the significant progress, several challenges hinder the widespread clinical adoption of FMs in medical imaging.
Technical Challenges:
- Lack of Open-Source Large-Scale Clinical Datasets: Privacy regulations and institutional policies limit access to large, diverse clinical datasets, impeding reproducibility and collaboration. Federated learning [li2025open] and differential privacy [wang2024fedmeki] could help.
- Increased Computational Demands: Scaling FMs for high-resolution WSIs and 3D volumes requires significant computational resources. Developing more efficient architectures and training strategies tailored for volumetric data is crucial [dominic_improving_2023].
- Scaling Limitations: Recent studies in pathology suggest a plateau in performance gains by simply scaling data and model size, emphasizing the need for better data curation, diverse benchmarks (beyond common public datasets), and domain-specific SSL algorithms [aben2024towards, chen2024benchmarking, campanella2024clinical]. Data quality and diversity across organs, pathologies, stains, and scanners are highlighted as critical [alber2025novel, dippel2024rudolfv].
- Adapting to the Medical Domain: Generic SSL frameworks optimized for natural images may not fully capture unique medical image characteristics like lack of canonical orientation, color variation, and the importance of specific fields of view [kang2023benchmarking].
Practical Challenges in Clinical Settings:
- Explainability: FMs are often black boxes, making it difficult for clinicians to understand predictions, identify biases, or trust results. Explainable AI (XAI) techniques are needed to provide transparent and interpretable insights [abbas_xdecompo_2022, pham_i-ai_2023].
- Robustness and Domain Generalization: Models must be robust to variations introduced by different equipment, protocols, and institutions. Current models can be unrobust to confounding factors like center-specific artifacts [de2025current]. Methods like distillation can improve robustness [filiot2025distilling].
- Fairness, Bias, and Accessibility: FMs can inherit biases from training data, leading to disparate performance across patient subgroups (e.g., by race or sex) [glocker_risk_2023, czum2023bias, khan_how_2023]. Ensuring diverse training data and developing debiasing strategies are essential. Accessibility is also a concern, as large FMs require substantial compute, highlighting the importance of efficient architectures [engelmann2024training] and PEFT [dutt_parameter-efficient_2023].
- Regulation: Clinical adoption requires addressing risks like hallucinations (plausible but incorrect outputs) and developing mechanisms for continual learning [wang2024comprehensive] to adapt to new data and disease variants without catastrophic forgetting.
In conclusion, foundation models hold immense potential for transforming medical image analysis by enabling data-efficient, generalizable, and robust AI systems. While significant progress has been made across pathology, radiology, and ophthalmology, particularly through large-scale self-supervised learning and multimodal approaches, challenges related to data access, computational resources, domain adaptation, interpretability, bias, and regulation must be actively addressed for successful clinical deployment. Future research should focus on developing more efficient and robust FM architectures, improving data curation and diversity, creating standardized benchmarks, and advancing XAI and continual learning techniques tailored to the complexities of medical imaging.