Papers
Topics
Authors
Recent
2000 character limit reached

CLAM-SB: Enhanced MIL for Breast Cancer

Updated 28 December 2025
  • The paper introduces a two-layer MLP classifier, expanded attention capacity, and robust regularization to outperform prior MIL methods in breast cancer recurrence risk stratification.
  • It leverages high-dimensional feature extraction using UNI and CONCH pre-trained models to convert H&E whole-slide images into informative patch embeddings.
  • Enhanced nonlinearities like GELU, focal loss, and label smoothing improve gradient flow and manage class imbalance in a low-sample regime.

The CLAM-SB model is a modified multiple instance learning (MIL) architecture designed for predictive computational pathology, specifically applied to the stratification of breast cancer recurrence risk from hematoxylin and eosin (H&E) stained whole-slide images (WSIs). Developed and evaluated as part of a comprehensive comparison of MIL frameworks, CLAM-SB builds on the original CLAM design while introducing key architectural and regularization enhancements to improve classification of 5-year recurrence risk tiers defined by molecular genomics.

1. Model Architecture

CLAM-SB follows an MIL paradigm in which each WSI is represented as a “bag” X={x1,,xN}X=\{x_1, \dots, x_N\} of NN non-overlapping 256×256256\times256 pixel patches. Each patch is transformed into a high-dimensional feature vector hih_i of dimension din=1024d_{\rm in}=1024 using a pre-trained feature extractor (UNI or CONCH). An instance encoder—a single fully connected (FC) layer—then compresses hih_i to a lower-dimensional embedding uiR512u_i\in\mathbb{R}^{512}, followed by GELU activation and Dropout regularization.

A gated attention module computes un-normalized attention scores for each patch embedding:

Ai=σ(Wpui+bp)tanh(Waui+ba)R384A_i = \sigma(W_p u_i + b_p) \odot \tanh(W_a u_i + b_a) \in \mathbb{R}^{384}

ai=wAi+ca_i = w^\top A_i + c

where Wa,WpR384×512W_a, W_p \in \mathbb{R}^{384 \times 512}, wR384w\in\mathbb{R}^{384}, and \odot denotes elementwise multiplication. Attention weights αi\alpha_i are calculated via softmax:

αi=exp(ai)j=1Nexp(aj)\alpha_i = \frac{\exp(a_i)}{\sum_{j=1}^{N} \exp(a_j)}

The slide-level embedding z=i=1Nαiuiz=\sum_{i=1}^N \alpha_i u_i is input to a two-layer MLP classifier (5122563512\to256\to3), with GELU and Dropout between layers, yielding logits R3\ell \in \mathbb{R}^3 corresponding to three risk classes. Final output probabilities p^k\hat{p}_k are computed by softmax.

2. Architectural Modifications to CLAM

CLAM-SB introduces significant deviations from baseline CLAM:

  • Classifier Depth: The original single-layer classifier is replaced with a two-layer MLP (5122563512\to256\to3), with intermediate Dropout.
  • Activation Function: GELU replaces ReLU throughout, with

GELU(x)x12[1+tanh(2π(x+0.044715x3))]\mathrm{GELU}(x) \approx x \frac12 \left[1 + \tanh\left(\sqrt{\tfrac{2}{\pi}}(x+0.044715\,x^3)\right)\right]

  • Attention Capacity: The attention network’s hidden dimension is increased from 256 to 384.
  • Regularization: Dropout of 0.4 is applied in the encoder, attention module, and classification head.
  • Loss Functions: Focal loss,

FL(pt)=αt(1pt)γlog(pt),γ=2,    αmedium=3.0\mathrm{FL}(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t), \quad \gamma=2,\;\;\alpha_\mathrm{medium}=3.0

addresses severe class imbalance (particularly under-represented medium-risk class). Label smoothing with ε=0.1\varepsilon=0.1 is applied:

y=(1ε)y+εK1,K=3y' = (1-\varepsilon)y + \frac{\varepsilon}{K}\mathbf{1},\quad K=3

3. Data Processing and Feature Extraction

WSIs in vendor “.sdpc” format are converted to “.svs”. Tissue segmentation is carried out on a low-magnification downsampled image via adaptive Gaussian blur, HSV color space conversion, Otsu thresholding on the saturation channel, morphological filtering, and mask extraction.

Subsequently, nonoverlapping 256×256256\times256 patches are extracted from within the tissue mask; locations are stored in HDF5 files. Patch features are generated using the TRIDENT toolbox with two pre-trained models:

  • UNI: ViT-L/16 (self-supervised DINOv2)
  • CONCH: Vision-LLM

Each patch is resized and projected into a 1024-dimensional feature space, which is saved for subsequent MIL processing.

4. Training Protocol and Hyperparameters

CLAM-SB is trained in a stratified 5-fold cross-validation on 210 WSIs (per-fold: approximately 168 train, 42 validation). The optimization uses the Adam algorithm with a learning rate of 3×1053\times10^{-5}, linear warm-up over 5 epochs, and weight decay of 1×1041\times10^{-4}. Dropout is fixed at 0.4. Training proceeds for up to 100 epochs with early stopping (patience 10) and a batch size of 1 WSI per step. The loss comprises bag-level focal loss and a 0.5 weight pseudo-labeling instance loss (as in “CLAM style”), with label smoothing factor ε=0.1\varepsilon=0.1. Attention hidden size is 384, encoder output size is 512.

5. Performance Evaluation

In five-fold cross-validation, CLAM-SB (using both UNI and CONCH features) achieved:

  • Mean AUC: 0.836
  • Mean accuracy: 76.2%

For reference, ABMIL (multi-head gated attention) attained mean AUC 0.767 and accuracy 70.9%, while ConvNeXt-MIL-XGBoost achieved accuracy 73.5% and macro F1-score 0.492.

Model Mean AUC Accuracy Macro F1
CLAM-SB (UNI+CONCH) 0.836 76.2%
ABMIL 0.767 70.9%
ConvNeXt-MIL-XGBoost 73.5% 0.492

Editor’s term: “SB” denotes the set of enhancements above baseline CLAM.

6. Analysis of Model Efficacy

CLAM-SB’s performance advantages are attributed to several factors:

  • Expanded Attention Capacity: The increase to 384 hidden units in the attention module enables richer modeling of subtle, high-dimensional histological cues.
  • Advanced Nonlinearities: GELU activation and a deeper classifier architecture enhance gradient flow and the learning of complex interactions.
  • Robust Regularization: Aggressive Dropout systematically reduces overfitting in the low-sample regime.
  • Improved Optimization for Imbalanced Data: Focal loss with class re-weighting and label smoothing ameliorates imbalanced detection of the under-represented medium-risk class.
  • Multi-Modal Pre-trained Features: Combining UNI (visual) and CONCH (vision-language) feature embeddings yields a more expressive patch representation.

Collectively, these modifications promote robust feature aggregation and classifier calibration, enabling superior stratification of breast cancer recurrence risk relative to alternative MIL implementations (Chen et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CLAM-SB Model.