ConvNeXt-MIL-XGBoost: Cancer Risk Prediction Pipeline

Updated 28 December 2025

The paper introduces a modular pipeline that integrates a frozen ConvNeXt backbone, gated attention MIL, and XGBoost for breast cancer recurrence risk stratification from H&E WSIs.
It employs a Transformer-inspired patch extraction and attention pooling mechanism to aggregate discriminative features across image patches.
The study demonstrates clinical relevance by achieving 73.5% accuracy and producing interpretable attention maps to support diagnostic decisions.

ConvNeXt-MIL-XGBoost is a modular weakly-supervised deep learning pipeline designed for automated stratification of breast cancer recurrence risk from Hematoxylin and Eosin (H&E) stained whole-slide images (WSIs). Developed and benchmarked alongside CLAM-SB and ABMIL models for predicting 5-year recurrence risk tiers, the method integrates a frozen ConvNeXt-Base neural feature extractor, a gated-attention Multiple Instance Learning (MIL) aggregator, and an XGBoost gradient-boosted decision tree classifier. The pipeline leverages robust natural-image priors and interpretable feature aggregation for genomics-correlated risk prediction, with demonstrated efficacy on a dataset of 210 patient cases (Chen et al., 21 Dec 2025).

1. ConvNeXt Backbone for Patch-Level Feature Extraction

The first stage of ConvNeXt-MIL-XGBoost entails decomposing each WSI into 256×256 px tissue-containing patches, which are independently encoded using the ConvNeXt-Base convolutional neural network. The ConvNeXt-Base architecture employs a “Transformer-inspired” design characterized by large depth-wise kernels, inverted bottlenecks, and LayerNorm, facilitating multi-scale texture and context capture. Core configuration parameters include depths per stage of [3, 3, 27, 3] layers; channel dimensions of [96, 192, 384, 768]; 4×4 patch-embedding convolutions at input, and subsequent 2×2 strided convolutions between stages. Patch representations are finalized by an additional fully-connected projection to a 1024-dimensional vector, yielding approximately 90 million parameters.

ImageNet-1k pre-trained weights initialize ConvNeXt, accelerating convergence and imparting robust feature representations given the small sample size (210 WSIs). No fine-tuning is performed; the backbone serves as a frozen encoder for subsequent stages, ensuring computational efficiency and reproducibility.

2. Attention-Based Multiple Instance Learning Aggregation

Following patch encoding, the MIL framework aggregates instance-level features $x_{i,j} \in \mathbb{R}^{1024}$ into a slide-level embedding $s_i \in \mathbb{R}^{1024}$ for each slide $i$ with $N_i$ patches. The aggregation employs a gated attention pooling mechanism:

$a_{i,j} = \mathrm{softmax}_j \left( w^T \tanh(V x_{i,j}) \odot \sigma(U x_{i,j}) \right)$

$s_i = \sum_{j=1}^{N_i} a_{i,j} x_{i,j}$

where $V, U \in \mathbb{R}^{384 \times 1024}$ are learnable weights, $\sigma$ denotes the sigmoid function, $w \in \mathbb{R}^{384}$ is a projection vector, $\odot$ indicates element-wise multiplication, and the softmax ensures $\sum_j a_{i,j}=1$ .

To address the pronounced class imbalance—evident from only 21 medium-risk slides—the attention module is trained using Focal Loss with $\gamma=2$ and disproportionately up-weighted medium-risk class ( $\alpha_{medium}=3.0$ , $\alpha_{low} = \alpha_{high} = 1.0$ ). The optimizer is Adam (β₁=0.9, β₂=0.999) with an initial learning rate of $1 \times 10^{-4}$ and batch size of 8 slides. This attention model aggregates diverse instances while focusing on discriminative tissue regions.

3. XGBoost Classification on Structured Slide Embeddings

The trained attention pooling module outputs for each WSI a 1024-dimensional embedding $s_i$ , supplemented by auxiliary features including MIL logits, softmax probabilities, attention distribution summary statistics (mean, median, quartiles, skewness), and patch count $N_i$ —resulting in a concatenated feature vector $f_i \in \mathbb{R}^{1047}$ .

Classification utilizes XGBoost, a gradient-boosted decision tree algorithm, targeting slide-level three-tier risk prediction ( $y_i\in\{0,1,2\}$ ). The multi-class softmax objective combines cross-entropy loss with tree complexity penalties:

$L(\Theta) = \sum_{i=1}^{M} \ell(y_i, \hat{y}_i^{(t)}) + \sum_{k=1}^{T} \Omega(f_k)$

$\Omega(f_k) = \gamma T_k + \frac{1}{2}\lambda\sum_j w_{k,j}^2$

with $T=200$ trees, learning rate $\eta=0.10$ , maximum tree depth 6, $\gamma=1.0$ , and L₂ regularization $\lambda=1.0$ . Hyperparameters are tuned on validation splits, and the XGBoost optimizer applies second-order gradient methods.

4. Training Protocol and Data Management

ConvNeXt-MIL-XGBoost training follows a robust 5-fold cross-validation procedure, with data splits stratified by risk class to preserve distributional balance. For each fold, 80% of slides are allocated for backbone and attention training, 10% for MIL validation and early stopping, and 10% as an unseen test set.

Risk labels are based on consensus between the 21-gene Recurrence Score and clinicopathological adjudication. MIL model training employs Focal Loss, while XGBoost operates on multi-class logistic loss. Adam optimizer (attention module) and XGBoost’s native optimizer ensure efficient convergence. No data augmentation is utilized beyond randomized patch ordering; patch segmentation and extraction are deterministic.

5. Comparative Performance and Ablation Insights

ConvNeXt-MIL-XGBoost exhibits mean classification accuracy of $73.5\% \pm 3.8\%$ across five cross-validation folds. Comparative results are summarized:

Model	Mean AUC	Mean Accuracy
CLAM-SB	$0.836 \pm 0.062$	$76.2\% \pm 4.5\%$
ABMIL	$0.767 \pm 0.046$	$70.9\% \pm 5.1\%$
ConvNeXt-MIL-XGBoost	–	$73.5\% \pm 3.8\%$

No AUC was computed for the XGBoost pipeline, and formal significance testing was omitted. Notably, ConvNeXt-Base provides superior patch embeddings over ResNet-18, raising classification accuracy by ≈5%. The gated attention (Sigmoid×tanh) improves aggregation over mean pooling (+4% accuracy). XGBoost outperforms equivalent multi-layer perceptron classifiers (same $f_i$ input), with MLP achieving only 70.0% accuracy and exhibiting greater hyperparameter sensitivity.

6. Clinical Deployment Considerations

The ConvNeXt-MIL-XGBoost framework supports integration as a decision-support module in digital pathology environments. Automated processing of digitized slides can provide rapid three-tier recurrence risk scores and deploy attention heatmap overlays to highlight diagnostically relevant tissue regions. This workflow can prioritize high-risk cases for review and assist in adjudicating borderline cases.

Recommended deployment steps include multi-center prospective validation, regulatory approval, and integration with laboratory information systems. Ongoing monitoring of model drift due to factors such as staining or scanner variations and re-calibration with local cohorts are advised to sustain clinical efficacy.

7. Modular Architecture and Interpretability

ConvNeXt-MIL-XGBoost is structured to decouple representation learning (ConvNeXt backbone), instance aggregation (attention MIL), and structured classification (XGBoost). This modularity yields an interpretable pipeline for genomics-correlated risk prediction, supporting transparent diagnostic audits and facilitating adaptation to evolving clinical requirements. The model’s attention maps and multi-tier output structure bolster its utility within computational pathology workflows, promoting rapid, cost-effective clinical decision support while maintaining methodological robustness (Chen et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Breast Cancer Recurrence Risk Prediction Based on Multiple Instance Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ConvNeXt-MIL-XGBoost.