VersaMammo: AI Mammogram Screening Model

Updated 1 October 2025

VersaMammo is a mammogram-specific foundation model that employs a two-stage pre-training strategy using a Vision Transformer and CNN-based EfficientNet for enhanced lesion detection and classification.
It integrates over 700,000 images from multi-institutional datasets to overcome challenges in diversity, generalizability, and clinical relevance seen in earlier models.
Benchmarked across 92 clinical tasks, VersaMammo shows state-of-the-art performance in lesion detection, segmentation, classification, image retrieval, and visual question answering.

VersaMammo is a mammogram-specific foundation model designed for AI-enabled breast cancer screening and diagnosis. Developed to overcome limitations in diversity, generalizability, and clinical relevance of earlier models, VersaMammo leverages a large-scale, multi-institutional dataset and a two-stage hybrid pre-training paradigm, achieving state-of-the-art results across lesion detection, lesion segmentation, classification, image retrieval, and visual question answering tasks (Huang et al., 24 Sep 2025).

1. Model Architecture and Training Strategy

VersaMammo employs a two-stream pre-training approach. The first stage uses a Vision Transformer (ViT) trained via self-supervised learning on unlabeled mammograms. Global representations are learned using both contrastive loss and Masked Image Modeling (MIM). The contrastive loss encourages similar representations for augmentations of the same image: $\mathcal{L}_{\text{contrast}} = - \log \frac{\exp \left( \operatorname{sim}(z_i, z_j) / \tau \right)}{\sum_{k} \exp \left( \operatorname{sim}(z_i, z_k) / \tau \right)}$ where $z_i$ and $z_j$ are embeddings of different augmented views of an image, $\operatorname{sim}(\cdot, \cdot)$ is cosine similarity, and $\tau$ is the temperature parameter.

For MIM, the reconstruction loss is defined as: $\mathcal{L}_{\text{MIM}} = \| M(x) - \hat{x} \|^2$ where $M(x)$ is the masked input and $\hat{x}$ is the reconstruction.

The second stage transitions knowledge from the ViT (teacher) to a CNN-based EfficientNet-b5 (student). Supervised training combines clinical classification losses and knowledge distillation, blending direct supervised signals with teacher-informed feature alignment: $\mathcal{L}_{\text{total}} = \lambda_{\text{sup}} \mathcal{L}_{\text{sup}} + \lambda_{\text{distill}} \mathcal{L}_{\text{distill}} + \lambda_{\text{contrast}} \mathcal{L}_{\text{contrast}}$ where each $\lambda$ balances its respective loss embedding supervised, distillation, and contrastive objectives.

This dual-phase, teacher-student framework concentrates rich global context in initial pre-training and efficiently exploits localized features via CNN architecture in the downstream model. The distillation process enables the final model to inherit both the generalizability of the large teacher and the inference efficiency of the student.

2. Training Dataset: Scale and Diversity

VersaMammo is trained on the largest curated mammogram dataset to date, comprising 706,239 images drawn from 21 separate sources (16 public and 5 private datasets). This multi-institutional composition addresses common generalization challenges in mammographic AI: cross-site variations, different imaging protocols, and diverse patient populations. Such scale enables robust learning of both universal and institution-specific imaging features, mitigating overfitting and promoting transferability.

3. Clinical Task Coverage and Benchmarking

VersaMammo is benchmarked across 92 clinically relevant tasks, subdivided into five principal categories:

Lesion detection (localization and bounding box prediction)
Lesion segmentation (dense mask delineation)
Classification (including breast composition, risk assessment, BI-RADS categorization, pathological diagnosis, molecular subtyping, and more)
Image retrieval (cross-sample similarity and case comparison)
Visual Question Answering (VQA: answering diagnostic and anatomical questions from images)

Among these, 68 are internal (validation and test sets from the trained institutions) and 24 are external (unseen datasets, for robust generalizability assessment).

VersaMammo ranks first in 50 out of 68 internal tasks and 20 out of 24 external tasks, with average ranks of 1.5 and 1.2, respectively.

4. Performance Metrics

VersaMammo’s performance is measured using standard, well-defined metrics for each clinical task:

Lesion detection: Mean Intersection over Union (IoU), with scores of 57.92% (high-res) and 52.69% (low-res)
Segmentation: Mean Dice coefficient, with scores of 63.21% (high-res) and 59.04% (low-res)
Classification: Area Under the Receiver Operating Characteristic curve (AUC), F1 score, and accuracy (ACC). For example, in BI-RADS assessment, VersaMammo achieves an average AUC of 82.95% (internal) and 67.26% (external).
Image Retrieval and VQA: Retrieval accuracy and AUC across anatomical, compositional, and pathology questions

5. Advances over Prior Foundation Models

Previous foundation models for mammography were constrained by a lack of scale, limited paired image-report annotations, and narrow clinical evaluation. Generalizability issues were common due to overfitting to single-site datasets. VersaMammo breaks this pattern by:

Utilizing broad and heterogeneous training data
Applying self-supervised learning for feature extraction, which does not require manual annotation
Ensuring efficient knowledge transfer to a CNN architecture for downstream deployment
Evaluating across 92 tasks to provide a comprehensive assessment of clinical relevance

6. Clinical Utility and Translational Impact

VersaMammo supports a range of diagnostic and decision support functions, including robust lesion detection/localization, mask segmentation, risk stratification (BI-RADS), retrieval and comparison of similar cases, and answering visual questions related to breast anatomy or pathology. Its consistent ranking across internal and external benchmarks suggests broad generalizability, which is critical for deployment across diverse clinical settings. The model’s ability to standardize assessment (e.g., BI-RADS scoring) could reduce variability among radiologists and improve early breast cancer detection.

7. Future Prospects and Integration Potential

The two-stage hybrid training and knowledge distillation paradigm found in VersaMammo establishes a prototype for future foundation models in medical imaging domains. The use of self-supervised ViT teachers for robust context learning, followed by efficient student CNN deployment, is likely to be extended to other applications requiring broad generalization with clinical interpretability. A plausible implication is that further expansion of the training dataset and benchmarking tasks might drive even higher clinical utility, promoting more reliable, scalable AI tools for breast imaging and other radiological screening scenarios.

VersaMammo exemplifies the convergence of large-scale multi-institutional data, advanced pre-training strategies, and comprehensive benchmarking, presenting a robust and generalizable foundation model for AI-driven clinical interpretation of mammograms (Huang et al., 24 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

A Versatile Foundation Model for AI-enabled Mammogram Interpretation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VersaMammo.