Whole Mammogram Classification

Updated 17 November 2025

Whole mammogram classification is a method that infers global diagnostic labels from entire images without relying on pixel-level lesion annotations.
It integrates techniques like multi-instance, multi-view, and transformer models to tackle challenges such as high-resolution data and extreme class imbalance.
Current systems employ transfer and curriculum learning approaches to enhance performance metrics, delivering scalable and explainable breast cancer screening solutions.

Whole mammogram classification is a computational paradigm in breast imaging wherein a model infers global or clinically relevant labels (malignant/benign, cancer subtype, screening outcome, etc.) from the entirety of a mammographic image or multi-image exam, without explicit pixel-level lesion annotation. This setting presents acute algorithmic challenges: lesions may occupy only a minute fraction of high-resolution images, clinically valid features extend across domains (texture, shape, global context), and standard supervised architectures must contend with class imbalance and weak labeling. The field spans feature engineering and shallow classifiers, end-to-end deep learning with transfer and curriculum strategies, multi-instance and multi-view learning, transformer and attention models, and clinical hybrid systems. Performance drivers are the ability to localize salient regions, leverage spatial-context, aggregate across views or exams, and generalize across hardware or populations.

1. Problem Definition and Motivations

Whole mammogram classification seeks to predict case-level attributes from native, full-field images without requiring radiologist-drawn ROIs or pixel-wise lesion markup. Key motivations are scalability to large collections, independence from labor-intensive annotation, and potential to support population-scale screening and therapy planning. Typical target variables include malignancy (binary or multi-class), cancer molecular subtypes (e.g., luminal vs. non-luminal), BI-RADS grades, or patient-level risk assessment.

This problem differs fundamentally from ROI-based detection or segmentation: models must learn to locate and weigh subtle candidate regions while integrating global and contextual cues. Algorithms must address extreme spatial sparsity (microcalcifications), broad tissue heterogeneity, and frequently imbalanced classes.

2. Early Feature Engineering and Classical Classification

Early pipelines decomposed mammograms into frequency and texture features, followed by classifier ensembles. Discrete wavelet transforms (DWT; Daubechies or biorthogonal bases) and Zernike moments have been widely used to extract multi-resolution texture and orthogonal shape features from the whole image (Lima et al., 2017, Tang et al., 2020). Statistically robust feature selection via information gain or deviation maximization can both enhance performance and reduce feature dimension.

Classification is then performed using SVMs (with weighted or multi-kernel extensions for feature weighting), neural networks (MLP or BP), ELM, decision trees, or naive Bayes. The multi-kernel approach (Symlet 8 wavelet + Zernike + SVM linear kernel) achieves notable accuracy (94.1 %) with sub-second training, a ratio 50× greater than previous feature-based methods. Ensemble voting further boosts accuracy to 96 %–97 %, sensitivity and specificity alike (Tang et al., 2020).

Such methods are:

Efficient (sub-second training),
Amenable to incremental/online updates,
Limited by the discriminative power of hand-engineered features.

3. End-to-End Deep Learning: Single-View and Multi-Instance Learning

Deep convolutional models, especially transfer-trained on large natural or medical corpora, have been successfully adapted to whole-mammogram input. Current methods include ResNet-based classifiers with finetuning (Panambur et al., 2023), multi-scale CNNs with curriculum learning (Lotter et al., 2017), and hybrid U-Net architectures for jointly learned segmentation/classification (Zhang et al., 2018).

A prominent strategy is Multi-Instance Learning (MIL), treating a mammogram as a bag of spatial patches (instances) without instance-level labels (Zhu et al., 2016, Zhu et al., 2017). CNN feature maps are processed so that each patch yields a malignancy probability, aggregated via max-pooling, top-k assignment, or sparse L₁ penalization to enforce the priors that masses are rare and spatially confined. The sparse-MIL variant achieves AUC 0.89 on INbreast, closing ~90% of the gap to full-annotation pipelines without any pixel-wise labels.

Ablation studies consistently show gains by pretraining on related tasks (lesion, abnormality, mass/calcification classification), imposing explicit spatial or label-sparsity priors, and jointly optimizing multi-task losses.

4. Transfer Learning and Curriculum Strategies

Transfer learning from auxiliary abnormality detection tasks is now central for domain adaptation. For luminal subtype prediction, ResNet-18 is first pretrained on a multi-label multi-class abnormality task (distinguishing mass, calcification, and malignancy). Full-network finetuning allows low-level filters to adapt, an essential step for molecular subtype separation (Panambur et al., 2023). This approach elevates mean AUC from 0.5358 (baseline, direct ImageNet finetuning) to 0.6688 (p<0.0001) and boosts F1-score for the minority class by 0.309.

Curriculum learning architectures, employing staged patch-level pretraining (with segmentation masks) and transfer of convolutional weights to global image or exam-level models, have yielded state-of-the-art AUROC (0.92 on DDSM) (Lotter et al., 2017). Notably, omission of lesion dataset pretraining leads to sharp performance declines (AUC 0.65), underscoring the importance of curriculum steps.

5. Multi-View, Transformer, and Attention-Based Approaches

Radiological exams always involve multi-view analysis; the computational literature has advanced from independent view processing to architectures maximizing inter-view correlations. Dual-view, multi-view, and even four-view transformer networks (e.g., MamT⁴ (Ibragimov et al., 2024), MV-Swin-T (Sarker et al., 2024), DCHA-Net (Wang et al., 2023), BRAIxMVCCL (Chen et al., 2022), Mammo-Clustering (Yang et al., 8 Jul 2025)) now integrate shifted-window attention, cross-view dynamic attention blocks, hybrid correlation losses, and context clustering modules.

MV-Swin-T employs two-stage omni-attention to blend CC and MLO features at the patch level, yielding substantial AUC gains over single-view baselines (CBIS-DDSM: +11%, VinDr-Mammo: marginal gain, but both near 96% AUC). BRAIxMVCCL fuses global consistency alignment with local co-occurrence attention, achieving AUC~0.95 on ADMANI-1 and strong generalization to cross-domain sets (Chen et al., 2022). MamT⁴ uses frozen CNN features and transformer attention across four standard mammogram views, reflecting radiologist practices and achieving ~84% ROC-AUC on VinDr-Mammo (Ibragimov et al., 2024). Mammo-Clustering capitalizes on a computationally efficient context clustering backbone with triple (global, feature-local, patch-local) fusion, surpassing transformer alternatives for localization and discrimination (Yang et al., 8 Jul 2025).

DCHA-Net further introduces row-wise hybrid attention and dual-view correlation losses to align corresponding strip-like tissue regions in CC and MLO, yielding state-of-the-art accuracy/AUC on INbreast and CBIS-DDSM with ablation showing additive benefits from correlation maximization (Wang et al., 2023).

6. Clinical Hybrid Models and Multi-Task Fusion

Real-world screening and triage demand hybrid systems integrating multiple sources and tasks. Multi-task pipelines fuse breast density predictors, global lesion classifiers (from each view), and object detection/ROI subnetworks (Wimmer et al., 2021). Patient-level meta-models combine scores and high-dimensional deep features via MLPs and CNN branches, yielding clear AUC improvements—score fusion AUC +0.02, feature fusion +0.04 over baseline ensembling. This "glass-box" workflow allows clinical drilldown from global pathology status to radiological features and bounding-box ROIs, supporting radiologist review, prioritization, and explanation.

Beyond deep ensembling, explainable systems such as DeepMiner (Wu et al., 2018) probe final-layer units of CNNs for interpretable BI-RADS concepts, with radiologist annotation of units (e.g., spiculation, margin) and generation of class activation maps aligned to lesion regions. Such approaches have shown evidence of expert-aligned explanation and even discovery of new diagnostic primitives.

7. Performance Benchmarks, Limitations, and Future Directions

Performance metrics for whole-mammogram classification typically encompass AUC, accuracy, sensitivity/specificity, and F1-score. Current benchmarks (from public sets such as DDSM, INbreast, CBIS-DDSM, VinDr-Mammo, and CMMD) reveal:

Feature-engineered SVM/voting models reach 94–97% accuracy (Lima et al., 2017, Tang et al., 2020)
End-to-end multi-instance or curriculum models reach AUC 0.89–0.92 (Zhu et al., 2016, Lotter et al., 2017, Zhang et al., 2018)
Multi-view attention models exceed 95% AUC (VinDr-Mammo, ADMANI-1, INbreast) (Sarker et al., 2024, Chen et al., 2022, Ibragimov et al., 2024, Yang et al., 8 Jul 2025)
Hybrid meta-model pipelines deliver AUC up to 0.962 for lesion detection (Wimmer et al., 2021)
Subtype classification from whole-image labels currently achieves 0.669 AUC (CMMD) for luminal discrimination (Panambur et al., 2023)

Major limitations include:

Lack of pixel- or ROI-level annotations for precise lesion localization, adversely affecting microcalcification sensitivity
Computational and memory burdens for high-resolution input
Weak generalization when patch or region pretraining is missing
Need for further validation on large, diverse, and prospectively collected cohorts

Future directions prioritize:

Weakly-supervised and multi-task learning for unified screening, BI-RADS, and molecular prediction
Efficient architectures (context clustering, attention, transformer fusion) for scalable deployment
Integration of radiomics and clinical metadata for hybrid feature models
Enhanced explainability and radiologist-in-the-loop workflows

Whole mammogram classification remains a central, rapidly evolving axis of AI-enabled breast imaging, with ongoing convergence of multi-view spatiotemporal modeling, computational efficiency, and clinical workflow support.