ConvNeXt-MIL-XGBoost: Cancer Risk Prediction Pipeline
- The paper introduces a modular pipeline that integrates a frozen ConvNeXt backbone, gated attention MIL, and XGBoost for breast cancer recurrence risk stratification from H&E WSIs.
- It employs a Transformer-inspired patch extraction and attention pooling mechanism to aggregate discriminative features across image patches.
- The study demonstrates clinical relevance by achieving 73.5% accuracy and producing interpretable attention maps to support diagnostic decisions.
ConvNeXt-MIL-XGBoost is a modular weakly-supervised deep learning pipeline designed for automated stratification of breast cancer recurrence risk from Hematoxylin and Eosin (H&E) stained whole-slide images (WSIs). Developed and benchmarked alongside CLAM-SB and ABMIL models for predicting 5-year recurrence risk tiers, the method integrates a frozen ConvNeXt-Base neural feature extractor, a gated-attention Multiple Instance Learning (MIL) aggregator, and an XGBoost gradient-boosted decision tree classifier. The pipeline leverages robust natural-image priors and interpretable feature aggregation for genomics-correlated risk prediction, with demonstrated efficacy on a dataset of 210 patient cases (Chen et al., 21 Dec 2025).
1. ConvNeXt Backbone for Patch-Level Feature Extraction
The first stage of ConvNeXt-MIL-XGBoost entails decomposing each WSI into 256×256 px tissue-containing patches, which are independently encoded using the ConvNeXt-Base convolutional neural network. The ConvNeXt-Base architecture employs a “Transformer-inspired” design characterized by large depth-wise kernels, inverted bottlenecks, and LayerNorm, facilitating multi-scale texture and context capture. Core configuration parameters include depths per stage of [3, 3, 27, 3] layers; channel dimensions of [96, 192, 384, 768]; 4×4 patch-embedding convolutions at input, and subsequent 2×2 strided convolutions between stages. Patch representations are finalized by an additional fully-connected projection to a 1024-dimensional vector, yielding approximately 90 million parameters.
ImageNet-1k pre-trained weights initialize ConvNeXt, accelerating convergence and imparting robust feature representations given the small sample size (210 WSIs). No fine-tuning is performed; the backbone serves as a frozen encoder for subsequent stages, ensuring computational efficiency and reproducibility.
2. Attention-Based Multiple Instance Learning Aggregation
Following patch encoding, the MIL framework aggregates instance-level features into a slide-level embedding for each slide with patches. The aggregation employs a gated attention pooling mechanism:
where are learnable weights, denotes the sigmoid function, is a projection vector, indicates element-wise multiplication, and the softmax ensures .
To address the pronounced class imbalance—evident from only 21 medium-risk slides—the attention module is trained using Focal Loss with and disproportionately up-weighted medium-risk class (, ). The optimizer is Adam (β₁=0.9, β₂=0.999) with an initial learning rate of and batch size of 8 slides. This attention model aggregates diverse instances while focusing on discriminative tissue regions.
3. XGBoost Classification on Structured Slide Embeddings
The trained attention pooling module outputs for each WSI a 1024-dimensional embedding , supplemented by auxiliary features including MIL logits, softmax probabilities, attention distribution summary statistics (mean, median, quartiles, skewness), and patch count —resulting in a concatenated feature vector .
Classification utilizes XGBoost, a gradient-boosted decision tree algorithm, targeting slide-level three-tier risk prediction (). The multi-class softmax objective combines cross-entropy loss with tree complexity penalties:
with trees, learning rate , maximum tree depth 6, , and L₂ regularization . Hyperparameters are tuned on validation splits, and the XGBoost optimizer applies second-order gradient methods.
4. Training Protocol and Data Management
ConvNeXt-MIL-XGBoost training follows a robust 5-fold cross-validation procedure, with data splits stratified by risk class to preserve distributional balance. For each fold, 80% of slides are allocated for backbone and attention training, 10% for MIL validation and early stopping, and 10% as an unseen test set.
Risk labels are based on consensus between the 21-gene Recurrence Score and clinicopathological adjudication. MIL model training employs Focal Loss, while XGBoost operates on multi-class logistic loss. Adam optimizer (attention module) and XGBoost’s native optimizer ensure efficient convergence. No data augmentation is utilized beyond randomized patch ordering; patch segmentation and extraction are deterministic.
5. Comparative Performance and Ablation Insights
ConvNeXt-MIL-XGBoost exhibits mean classification accuracy of across five cross-validation folds. Comparative results are summarized:
| Model | Mean AUC | Mean Accuracy |
|---|---|---|
| CLAM-SB | ||
| ABMIL | ||
| ConvNeXt-MIL-XGBoost | – |
No AUC was computed for the XGBoost pipeline, and formal significance testing was omitted. Notably, ConvNeXt-Base provides superior patch embeddings over ResNet-18, raising classification accuracy by ≈5%. The gated attention (Sigmoid×tanh) improves aggregation over mean pooling (+4% accuracy). XGBoost outperforms equivalent multi-layer perceptron classifiers (same input), with MLP achieving only 70.0% accuracy and exhibiting greater hyperparameter sensitivity.
6. Clinical Deployment Considerations
The ConvNeXt-MIL-XGBoost framework supports integration as a decision-support module in digital pathology environments. Automated processing of digitized slides can provide rapid three-tier recurrence risk scores and deploy attention heatmap overlays to highlight diagnostically relevant tissue regions. This workflow can prioritize high-risk cases for review and assist in adjudicating borderline cases.
Recommended deployment steps include multi-center prospective validation, regulatory approval, and integration with laboratory information systems. Ongoing monitoring of model drift due to factors such as staining or scanner variations and re-calibration with local cohorts are advised to sustain clinical efficacy.
7. Modular Architecture and Interpretability
ConvNeXt-MIL-XGBoost is structured to decouple representation learning (ConvNeXt backbone), instance aggregation (attention MIL), and structured classification (XGBoost). This modularity yields an interpretable pipeline for genomics-correlated risk prediction, supporting transparent diagnostic audits and facilitating adaptation to evolving clinical requirements. The model’s attention maps and multi-tier output structure bolster its utility within computational pathology workflows, promoting rapid, cost-effective clinical decision support while maintaining methodological robustness (Chen et al., 21 Dec 2025).