TransMIL: Transformer MIL for WSIs
- TransMIL is a transformer-based framework for MIL that explicitly models morphological and spatial correlations in WSIs to address i.i.d. limitations.
- It employs efficient transformer self-attention with Nyström approximation, allowing scalable processing of up to 25,000 patch embeddings.
- Experimental results show improved tumor subtype classification accuracy and AUROC, especially when using domain-specific pretraining such as KimiaNet.
TransMIL is a transformer-based framework for Multiple Instance Learning (MIL) designed for whole slide image (WSI) classification in digital pathology. It addresses the limitations of classical MIL—specifically, the independence-and-identical-distribution (i.i.d.) assumption—by explicitly modeling morphological and spatial correlations among image patches. TransMIL has established new benchmarks in tasks such as tumor subtype classification and metastasis detection, combining algorithmic efficiency with interpretability and the ability to incorporate domain-specific pretraining.
1. Motivation and Theoretical Foundation
Traditional deep MIL approaches, such as max-pooling, mean-pooling, and attention-based pooling (e.g., ABMIL, CLAM, DSMIL), operate under the i.i.d. assumption for bag instances, neglecting contextual or spatial dependencies critical in computational pathology. TransMIL introduces a correlated MIL framework, grounded in mathematical results:
- Any continuous bag-level function can be approximated by combining morphological () and spatial () encodings such that
for any .
- Entropy analysis shows that explicitly modeling correlations among instances reduces uncertainty compared to the i.i.d. case:
This theoretical basis motivates TransMIL's architecture, which leverages global context modeling via transformer attention to capture pairwise and higher-order patch dependencies (Shao et al., 2021).
2. Data Preprocessing and Patch Embedding
WSI classification with TransMIL requires several pre-processing steps:
- Segmentation and Tiling: Each WSI is tissue-segmented (e.g., using the CLAM contour-based or Otsu thresholding methods) and split into non-overlapping patches, typically at or at magnification, discarding those with insufficient tissue content.
- Patch Feature Encoding:
- Generic pretraining: Standard deep networks (e.g., ResNet-50 or DenseNet-121 pre-trained on ImageNet) map each patch to a feature vector ( for ResNet-50, for DenseNet-121).
- Domain-specific pretraining: KimiaNet (DenseNet-121 pre-trained on ≈11,000 TCGA WSIs) yields $1024$-dimensional features with filters tuned to histopathological texture and color (Chitnis et al., 2023), enhancing downstream transformer performance.
3. Architecture and Computational Pipeline
TransMIL processes a WSI as a bag of patch embeddings and aggregates them using a transformer pipeline:
3.1 Transformer Module
- Linear Projection & Position Encoding: Patch features are projected (learned linear mapping) and augmented with multi-scale spatial context via the Pyramid Positional Encoding Generator (PPEG), using parallel , , and convolutions.
- “Squaring” the Sequence: Patch sequences are padded or repeated to the nearest perfect square. A learnable class token is prepended.
- Self-Attention Layers (2×): Each consists of:
- Multi-head self-attention (MHSA):
For each head ,
Outputs from all heads are concatenated and projected via . - Feed-forward sub-layer (2-layer MLP, ReLU/GeLU), residual connections, and layer normalization.
- Efficient Self-Attention (Nyströmformer): Classical attention is replaced by a Nyström approximation, selecting landmark queries/keys to approximate the attention kernel with complexity. This enables aggregation over up to 25,000 patch embeddings per slide (Shao et al., 2021, Sens et al., 2023).
3.2 Slide-Level Aggregation and Classification
- Pooling: Mean-pooling is performed across instance dimension post-transformer:
Alternatively, the final state of the class token is used as a learned bag representation.
- Prediction Head: A linear classifier produces logits for classes, post-softmax yielding predictions .
- Loss: Standard cross-entropy loss is used,
Optionally, auxiliary losses such as Bag Embedding Loss (BEL) may be included (Sens et al., 2023).
4. Bag Embedding Loss (BEL) and Enhanced Supervision
BEL is a margin-based cosine similarity loss introduced to encourage discriminative bag representations under weak supervision, particularly beneficial for rare class separation:
- For class , with bag embedding and historical class centroid (EMA of previous embeddings), BEL is:
where is cosine similarity and a margin hyperparameter.
- BEL explicitly clusters same-class bag embeddings and separates different-class embeddings, providing a more robust representation space and measurably improving accuracy and AUROC, particularly for rare classes.
- On BRACS, TransMIL with BEL achieved accuracy ( absolute) and AUROC () over the base model (Sens et al., 2023).
5. Experimental Evaluation and Comparative Performance
TransMIL has been validated on diverse computational pathology benchmarks:
| Dataset | Method | AUC (%) | Accuracy (%) | Confidence (%) |
|---|---|---|---|---|
| Glioma (5-way) | TransMIL + DenseNet121 | 95.52 ± 2.76 | 90.58 ± 1.93 | 93.28 ± 4.98 |
| TransMIL + KimiaNet | 96.85 ± 1.20 | 89.36 ± 2.30 | 93.57 ± 3.43 | |
| CAMELYON16 | TransMIL (i.i.d. MIL) | 87.60 – 88.09 | n/a | n/a |
| TransMIL (correlated MIL) | 93.09 | n/a | n/a | |
| TCGA-NSCLC | Best competing MIL | 94.63 | n/a | n/a |
| TransMIL | 96.03 | n/a | n/a | |
| TCGA-RCC | Best competing MIL | 98.41 | n/a | n/a |
| TransMIL | 98.82 | n/a | n/a |
All results are reported as mean ± standard deviation across multiple splits and initializations (Shao et al., 2021, Chitnis et al., 2023).
Notably, TransMIL achieves faster convergence—peak validation AUC in ~50 epochs versus 100–150 for attention-based pooled MIL baselines. Clinical applicability is supported by high prediction confidence (average predicted class-probability on test slides above 93%) and interpretability through per-patch attention heatmaps which align with expert-annotated tumor regions.
6. Domain-Specific Pretraining and Model Confidence
Domain-specific patch encoders such as KimiaNet—pre-trained on hematoxylin & eosin (H&E) stained WSIs—improve both the discriminative capacity and stability of TransMIL:
- When deploying TransMIL with KimiaNet versus generic ImageNet pretraining, mean confidence rises slightly (93.57% vs 93.28%) and variance across runs decreases (±3.43 vs. ±4.98), indicating more stable and reliable predictions (Chitnis et al., 2023).
- The filters in domain-specific encoders are tuned to morphological and color traits in histopathology, which, under multi-head self-attention, yield sharper, more biologically meaningful attention matrices focusing on diagnostically relevant patches.
- This effect is particularly relevant for pathology, where confidence and interpretability are critical for clinical decision-support.
7. Interpretability, Limitations, and Future Directions
TransMIL produces attention maps highlighting regions in WSIs contributing most to the predicted diagnosis, supporting expert review and trust. Overlaying attention scores on tissue slides produces heatmaps that correspond closely to pathologist-annotated areas.
Limitations include:
- Memory and computation, although linearized by the Nyström approximation, are still significant for very large WSIs or at high magnification.
- The current pipeline does not implement hierarchical or multi-resolution patch aggregation.
- Global centroid buffers for auxiliary losses such as BEL introduce minor but nontrivial memory and bookkeeping overheads when class counts rise.
Future extensions suggested include hierarchical transformers for multi-resolution analysis, graph-based patch modeling, semi/self-supervised WSI pretraining, and adaptation to survival analysis or spatial distribution prediction (Shao et al., 2021, Sens et al., 2023).
References
- "Domain-Specific Pre-training Improves Confidence in Whole Slide Image Classification" (Chitnis et al., 2023)
- "BEL: A Bag Embedding Loss for Transformer enhances Multiple Instance Whole Slide Image Classification" (Sens et al., 2023)
- "TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification" (Shao et al., 2021)