Multi-Institutional Mammogram Dataset

Updated 1 October 2025

Multi-institutional mammogram datasets are collections that aggregate images and clinical data from various centers, capturing diverse imaging protocols and patient demographics.
They enable robust AI model training by addressing data heterogeneity, reducing bias, and improving generalization across different vendor devices and protocols.
Advanced methodologies such as multi-instance learning, multi-view fusion, and federated learning are applied to overcome annotation disparities and scalability challenges.

A multi-institutional mammogram dataset comprises mammographic images and associated clinical data collected from multiple distinct clinical centers, institutions, or imaging sources. These datasets are critical for developing, benchmarking, and generalizing AI algorithms applied to breast cancer screening and diagnosis. Multi-institutional datasets inherently capture heterogeneity in imaging hardware, acquisition protocols, patient demographics, and annotation practices, making them indispensable for both robust algorithmic development and translational evaluation across geographically and technically diverse clinical environments.

1. Composition and Diversity

Multi-institutional mammogram datasets are constructed by aggregating imaging data and corresponding metadata from several independent sources. The largest reported example, as in the VersaMammo project, includes 706,239 images from 21 sources, covering both public repositories (e.g., CBIS-DDSM, INBreast, BMCD, RSNA-Mammo) and private institutional cohorts (Huang et al., 24 Sep 2025). This diversity encompasses:

Multiple imaging vendors and device types.
Variation in imaging protocols and acquisition parameters.
Broad patient demographic representation, including studies such as EMBED which were specifically curated for racial and epidemiological diversity (Jeong et al., 2022).
A wide range of clinical scenarios: screening and diagnostic exams, digital breast tomosynthesis (DBT), full-field digital mammograms (FFDM), and synthetic views.
Inclusion of both processed and raw DICOM images, permitting algorithmic exploration of pre- and post-processing effects (Halling-Brown et al., 2020).

Many datasets link images to comprehensive clinical and pathological information, covering prior screening history, biopsy results, surgical outcomes, and longitudinal follow-up. Additionally, expert radiologist annotation may be present for lesion localization, mass characteristics, and structured imaging descriptors such as BI-RADS assessment (Jeong et al., 2022, Halling-Brown et al., 2020).

2. Technical and Methodological Challenges

Multi-institutional mammogram datasets address key technical challenges distinct from those inherent to single-institution datasets:

Heterogeneity in Data Distribution: Multi-institutional sources introduce covariate shifts related to device, protocol, and population, as observed in domain transfer experiments and cross-validation protocols (Yang et al., 2023, Seyyedi et al., 2020). This variability necessitates robust pre-processing (e.g., standardized cropping (Ibragimov et al., 3 Nov 2024)), harmonization, and domain generalization mechanisms within model architectures.
Annotation Disparity: Detailed region-of-interest (ROI) annotations are often institution-specific, limiting their scalability. End-to-end weakly supervised methods such as deep multi-instance learning (MIL) frameworks have been proposed to learn from whole-image or breast-level labels without the need for ROI annotation (Zhu et al., 2016, Zhu et al., 2017, Pathak et al., 2023).
Scalability: High-resolution images (often >3k×3k pixels), massive image counts (>10⁶ images), and privacy restrictions necessitate computationally efficient algorithms for storage, transfer, and training. Federated learning frameworks have been successfully applied to multi-institutional data for breast density estimation, preserving privacy while improving generalizability (Muthukrishnan et al., 2022).

3. Impact on AI Algorithm Design and Evaluation

Large-scale multi-institutional datasets underpin advances in both general-purpose mammography foundation models and specialized screening algorithms. Key effects on algorithm development include:

Generalization Across Domains: Diverse training cohorts have been shown to improve generalization and reduce model sensitivity to domain shifts, as evidenced by models such as VersaMammo (Huang et al., 24 Sep 2025), MammoDG (Yang et al., 2023), and SCREENet (Seyyedi et al., 2020). For example, generalization capabilities are validated by evaluating models on held-out "unseen" domains, which can be from distinct institutions or using different vendor devices (Yang et al., 2023).
Benchmarking and Task Diversity: Multi-institutional resources enable the construction of comprehensive benchmarks, as exemplified by the 92-task suite used for VersaMammo (lesion detection, segmentation, classification, retrieval, VQA) (Huang et al., 24 Sep 2025). Performance metrics include AUC, F1, accuracy, Dice coefficient for segmentation, and top-k retrieval accuracy.
Reduction of Bias and Evaluation of Fairness: Datasets such as EMBED, with racially and demographically balanced cohorts, allow for the development and auditing of AI models on underrepresented populations, thus directly addressing equity in diagnostic performance (Jeong et al., 2022).

4. Representative Datasets and Their Properties

A non-exhaustive set of major multi-institutional mammogram datasets is summarized below.

Name	Scale	Key Properties
VersaMammo	706,239 images (21 sets)	Diverse imaging sources, public/private, >90 tasks, pre-training
OPTIMAM	2.5M+ images	3 UK centers, detailed clinical outcomes, expert marking (Halling-Brown et al., 2020)
EMBED	3.5M images, 116k pts	US-based, racially balanced, lesion/path outcome granularity (Jeong et al., 2022)
ADMANI	Millions	Australia, curated, with technical outlier labels (Li et al., 2023)
VinDr-Mammo	20,000 images	Vietnam, four-view digital mammography, external validation (Ibragimov et al., 3 Nov 2024)

Many recent benchmarks combine several of these and other datasets to promote cross-institutional evaluation and reproducibility.

5. Algorithmic Innovations Leveraging Multi-Institutional Data

The richness of multi-institutional data has motivated diverse modeling strategies:

MIL and Weak Supervision: Deep multi-instance networks for patch-level aggregation, exploiting sparsity priors due to the small fraction of malignant tissue per image (Zhu et al., 2016, Zhu et al., 2017, Pathak et al., 2023).
Multi-view Fusion: Transformer-based and context clustering models that integrate information across the standard four projections (left/right, CC/MLO) (Ibragimov et al., 3 Nov 2024, Sarker et al., 26 Feb 2024, Chen et al., 28 Apr 2025, Yang et al., 8 Jul 2025); models such as MamT⁴ and MVPT-NET exemplify advanced feature-level or attention-based view integration.
Domain Generalization and Contrastive Learning: Approaches such as MammoDG employ cross-view enhancement and multi-instance contrastive learning to combat protocol- and vendor-induced domain shifts (Yang et al., 2023).
Scalability and Efficiency: Prompt tuning (updating a small subset of parameters during multi-view adaptation (Chen et al., 28 Apr 2025)) and context clustering (for computational efficiency and finer structure preservation (Yang et al., 8 Jul 2025)).

Multi-institutional cohorts require frameworks for secure data sharing, harmonization, and community engagement:

Data Governance and Tools: Centralized repositories (e.g., OPTIMAM, EMBED) employ pseudonymization, sharing agreements, cloud-based storage, and APIs for data access and exploration (Halling-Brown et al., 2020, Jeong et al., 2022).
Privacy-Preserving Computation: Federated learning enables collaborative model training across institutions without patient data exchange, demonstrating strong performance with only marginal drops compared to centralized training (Muthukrishnan et al., 2022).
Prospects: Ongoing expansion includes incorporation of non-mammographic modalities (tomosynthesis, MRI), continued accrual of annotated cases, and integration with multi-modal (vision-language) clinical records for advanced tasks such as report generation and visual QA (Huang et al., 24 Sep 2025).

A plausible implication is that, as model generalization remains limited by the diversity and scale of training data, continued institutional collaboration and open dataset contributions will be critical for advancing clinically robust CAD systems. The use of multi-task, multi-domain, and multi-modal benchmarks will likely remain the gold standard for future model evaluation.

7. Challenges and Considerations

Remaining challenges include:

Annotation Harmonization: Differences in annotation guidelines and quality persist between centers, complicating supervised learning across datasets (Zhu et al., 2016, Zhu et al., 2017).
Technical Artifacts: Automated outlier detection is necessary to exclude images with implants, improper exposure, or artifacts, as in the ADMANI dataset, where convolutional VAEs and classical image processing achieve improved but still imperfect outlier recall (Li et al., 2023).
Domain Shift: Even with large-scale aggregation, models may still show degraded performance on previously unseen institution-specific data, requiring ongoing methodological developments in domain adaptation and generalization (Yang et al., 2023, Huang et al., 24 Sep 2025).

In conclusion, multi-institutional mammogram datasets have catalyzed significant methodological advances, enabling robust, generalizable models, unbiased evaluation, and large-scale clinical validation. The field now progresses toward foundation models, encompassing a wide clinical task spectrum and explicitly tailored to the technical and biological diversity captured by multi-institutional data resources.