TransMIL: Transformer MIL for WSIs

Updated 11 March 2026

TransMIL is a transformer-based framework for MIL that explicitly models morphological and spatial correlations in WSIs to address i.i.d. limitations.
It employs efficient transformer self-attention with Nyström approximation, allowing scalable processing of up to 25,000 patch embeddings.
Experimental results show improved tumor subtype classification accuracy and AUROC, especially when using domain-specific pretraining such as KimiaNet.

TransMIL is a transformer-based framework for Multiple Instance Learning (MIL) designed for whole slide image (WSI) classification in digital pathology. It addresses the limitations of classical MIL—specifically, the independence-and-identical-distribution (i.i.d.) assumption—by explicitly modeling morphological and spatial correlations among image patches. TransMIL has established new benchmarks in tasks such as tumor subtype classification and metastasis detection, combining algorithmic efficiency with interpretability and the ability to incorporate domain-specific pretraining.

1. Motivation and Theoretical Foundation

Traditional deep MIL approaches, such as max-pooling, mean-pooling, and attention-based pooling (e.g., ABMIL, CLAM, DSMIL), operate under the i.i.d. assumption for bag instances, neglecting contextual or spatial dependencies critical in computational pathology. TransMIL introduces a correlated MIL framework, grounded in mathematical results:

Any continuous bag-level function can be approximated by combining morphological ( $f(\cdot)$ ) and spatial ( $h(\cdot)$ ) encodings such that

$\Bigl| S(\mathbf X) - g\bigl( \{ f(\mathbf x) + h(\mathbf x) \}_{\mathbf x \in \mathbf X} \bigr) \Bigr| < \varepsilon,$

for any $\varepsilon > 0$ .

Entropy analysis shows that explicitly modeling correlations among instances reduces uncertainty compared to the i.i.d. case:

$H(\Theta_1,\ldots,\Theta_n) \leq \sum_{t=1}^n H(\Theta_t).$

This theoretical basis motivates TransMIL's architecture, which leverages global context modeling via transformer attention to capture pairwise and higher-order patch dependencies (Shao et al., 2021).

2. Data Preprocessing and Patch Embedding

WSI classification with TransMIL requires several pre-processing steps:

Segmentation and Tiling: Each WSI is tissue-segmented (e.g., using the CLAM contour-based or Otsu thresholding methods) and split into non-overlapping patches, typically $256 \times 256$ at $20\times$ or $512 \times 512$ at $40\times$ magnification, discarding those with insufficient tissue content.
Patch Feature Encoding:
- Generic pretraining: Standard deep networks (e.g., ResNet-50 or DenseNet-121 pre-trained on ImageNet) map each patch $p_i$ to a feature vector $f_i \in \mathbb{R}^d$ ( $d=2048$ for ResNet-50, $d=1024$ for DenseNet-121).
- Domain-specific pretraining: KimiaNet (DenseNet-121 pre-trained on ≈11,000 TCGA WSIs) yields $1024$-dimensional features with filters tuned to histopathological texture and color (Chitnis et al., 2023), enhancing downstream transformer performance.

3. Architecture and Computational Pipeline

TransMIL processes a WSI as a bag of $N$ patch embeddings $\{f_1,\ldots,f_N\}$ and aggregates them using a transformer pipeline:

3.1 Transformer Module

Linear Projection & Position Encoding: Patch features are projected (learned linear mapping) and augmented with multi-scale spatial context via the Pyramid Positional Encoding Generator (PPEG), using parallel $3\times 3$ , $5\times 5$ , and $7\times 7$ convolutions.
“Squaring” the Sequence: Patch sequences are padded or repeated to the nearest perfect square. A learnable class token is prepended.
Self-Attention Layers (2×): Each consists of:
- Multi-head self-attention (MHSA):
$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

For each head $h$ ,

$A_h = \mathrm{softmax} \left( \frac{Q_h K_h^T}{\sqrt{d_k}} \right) V_h$

Outputs from all heads are concatenated and projected via $W_O$ . - Feed-forward sub-layer (2-layer MLP, ReLU/GeLU), residual connections, and layer normalization.
Efficient Self-Attention (Nyströmformer): Classical $O(N^2)$ attention is replaced by a Nyström approximation, selecting $m\ll N$ landmark queries/keys to approximate the attention kernel with $O(N m)$ complexity. This enables aggregation over up to 25,000 patch embeddings per slide (Shao et al., 2021, Sens et al., 2023).

3.2 Slide-Level Aggregation and Classification

Pooling: Mean-pooling is performed across instance dimension post-transformer:

$s = \frac{1}{N}\sum_{i=1}^N Z_{\text{out}}[i,:]$

Alternatively, the final state of the class token is used as a learned bag representation.

Prediction Head: A linear classifier produces logits for $C$ classes, post-softmax yielding predictions $p\in\mathbb{R}^{C}$ .
Loss: Standard cross-entropy loss is used,

$\mathcal{L}_{\text{CE}} = -\sum_{c=1}^C 1_{y=c} \log p_c$

Optionally, auxiliary losses such as Bag Embedding Loss (BEL) may be included (Sens et al., 2023).

4. Bag Embedding Loss (BEL) and Enhanced Supervision

BEL is a margin-based cosine similarity loss introduced to encourage discriminative bag representations under weak supervision, particularly beneficial for rare class separation:

For class $c$ , with bag embedding $b_{c,\text{new}}$ and historical class centroid $b_{c,\text{old}}$ (EMA of previous embeddings), BEL is:

$\mathcal{L}_{\mathrm{BEL}} = \frac{1}{2}\left[1 - S(b_{c,\mathrm{new}}, b_{c,\mathrm{old}})\right] + \frac{1}{2(|C|-1)} \sum_{c'\in C, c'\neq c} \max\left\{0, S(b_{c,\mathrm{new}}, b_{c',\mathrm{old}}) - m\right\}$

where $S(\cdot,\cdot)$ is cosine similarity and $m$ a margin hyperparameter.

BEL explicitly clusters same-class bag embeddings and separates different-class embeddings, providing a more robust representation space and measurably improving accuracy and AUROC, particularly for rare classes.
On BRACS, TransMIL with BEL achieved $0.60 \pm 0.02$ accuracy ( $+0.06$ absolute) and $0.76 \pm 0.01$ AUROC ( $+0.04$ ) over the base model (Sens et al., 2023).

5. Experimental Evaluation and Comparative Performance

TransMIL has been validated on diverse computational pathology benchmarks:

Dataset	Method	AUC (%)	Accuracy (%)	Confidence (%)
Glioma (5-way)	TransMIL + DenseNet121	95.52 ± 2.76	90.58 ± 1.93	93.28 ± 4.98
	TransMIL + KimiaNet	96.85 ± 1.20	89.36 ± 2.30	93.57 ± 3.43
CAMELYON16	TransMIL (i.i.d. MIL)	87.60 – 88.09	n/a	n/a
	TransMIL (correlated MIL)	93.09	n/a	n/a
TCGA-NSCLC	Best competing MIL	94.63	n/a	n/a
	TransMIL	96.03	n/a	n/a
TCGA-RCC	Best competing MIL	98.41	n/a	n/a
	TransMIL	98.82	n/a	n/a

All results are reported as mean ± standard deviation across multiple splits and initializations (Shao et al., 2021, Chitnis et al., 2023).

Notably, TransMIL achieves faster convergence—peak validation AUC in ~50 epochs versus 100–150 for attention-based pooled MIL baselines. Clinical applicability is supported by high prediction confidence (average predicted class-probability on test slides above 93%) and interpretability through per-patch attention heatmaps which align with expert-annotated tumor regions.

6. Domain-Specific Pretraining and Model Confidence

Domain-specific patch encoders such as KimiaNet—pre-trained on hematoxylin & eosin (H&E) stained WSIs—improve both the discriminative capacity and stability of TransMIL:

When deploying TransMIL with KimiaNet versus generic ImageNet pretraining, mean confidence rises slightly (93.57% vs 93.28%) and variance across runs decreases (±3.43 vs. ±4.98), indicating more stable and reliable predictions (Chitnis et al., 2023).
The filters in domain-specific encoders are tuned to morphological and color traits in histopathology, which, under multi-head self-attention, yield sharper, more biologically meaningful attention matrices focusing on diagnostically relevant patches.
This effect is particularly relevant for pathology, where confidence and interpretability are critical for clinical decision-support.

7. Interpretability, Limitations, and Future Directions

TransMIL produces attention maps highlighting regions in WSIs contributing most to the predicted diagnosis, supporting expert review and trust. Overlaying attention scores on tissue slides produces heatmaps that correspond closely to pathologist-annotated areas.

Limitations include:

Memory and computation, although linearized by the Nyström approximation, are still significant for very large WSIs or at high magnification.
The current pipeline does not implement hierarchical or multi-resolution patch aggregation.
Global centroid buffers for auxiliary losses such as BEL introduce minor but nontrivial memory and bookkeeping overheads when class counts rise.

Future extensions suggested include hierarchical transformers for multi-resolution analysis, graph-based patch modeling, semi/self-supervised WSI pretraining, and adaptation to survival analysis or spatial distribution prediction (Shao et al., 2021, Sens et al., 2023).

References

"Domain-Specific Pre-training Improves Confidence in Whole Slide Image Classification" (Chitnis et al., 2023)
"BEL: A Bag Embedding Loss for Transformer enhances Multiple Instance Whole Slide Image Classification" (Sens et al., 2023)
"TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification" (Shao et al., 2021)

Markdown Report Issue Upgrade to Chat

References (3)

TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification (2021)

Domain-Specific Pre-training Improves Confidence in Whole Slide Image Classification (2023)

BEL: A Bag Embedding Loss for Transformer enhances Multiple Instance Whole Slide Image Classification (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransMIL.