MambaMIL+ for Gigapixel WSI Analysis

Updated 26 December 2025

MambaMIL+ is a multiple instance learning framework for gigapixel whole-slide images that integrates overlapping scanning, S²PE, and CTS to improve spatial context and diagnostic accuracy.
It employs a linear-complexity state-space model to efficiently capture long-range dependencies, yielding measurable gains in classification, molecular prediction, and survival analysis.
The framework uses advanced positional encoding and entropy-based token selection to mitigate memory decay and enhance long-sequence feature integration.

MambaMIL+ is a state-of-the-art multiple instance learning (MIL) framework designed for long-sequence modeling in gigapixel whole-slide images (WSIs), particularly in computational pathology. Building on the linear-complexity Mamba state space model, MambaMIL+ introduces three substantial innovations—overlapping scanning, a selective stripe position encoder (S²PE), and contextual token selection (CTS)—which collectively enhance spatial context integration and stabilize memory for long-range dependency modeling. MambaMIL+ sets new benchmarks in diagnostic classification, molecular prediction, and survival analysis across diverse datasets and feature extractors, demonstrating consistent performance improvements over previous MIL methods (Zeng et al., 19 Dec 2025).

1. Background and Formalization

In the MIL paradigm applied to WSIs, a slide $X$ is partitioned into $N$ overlapping or non-overlapping patches $\{x_i\}_{i=1}^N$ , with each patch $x_i \in \mathbb{R}^{H \times W \times 3}$ . Each patch is converted to a $D$ -dimensional embedding $p_i = f(x_i) \in \mathbb{R}^D$ by a frozen CNN or transformer-based backbone (e.g., ResNet-50, PLIP, CONCH). The sequence $\mathcal{P} = \{p_1, \ldots, p_N\}$ is input to a MIL module (here, MambaMIL+), which aggregates embeddings and outputs a bag-level prediction $\hat{Y} = g(p_1,\ldots,p_N)$ for tasks such as classification or survival analysis. Weak supervision and ultra-long sequence lengths pose major modeling challenges, including context integration, memory decay, and computational scalability (Zeng et al., 19 Dec 2025).

2. MambaMIL+ Architecture and Innovations

MambaMIL+ leverages the Mamba selective state-space dual (SSD) model, which achieves global receptive fields and strictly linear cost in sequence length ( $O(L)$ ). To overcome memory and context limitations in baseline MambaMIL, MambaMIL+ introduces the following:

Overlapping Scanning: WSIs are scanned using four interleaved, shifted grids with stride $s$ and overlap $o = P - s$ (where $P$ is patch size), yielding $4N$ patches and explicit spatial redundancy. The sequences from all four grids are concatenated, quadrupling token count and embedding neighborhood context directly into the instance-level representation.
Selective Stripe Position Encoder (S²PE): To counter order bias and encode positional information, tokens are spatially mapped back into an $H \times W \times D$ grid using a bijection $\mathcal{T}$ . A 1D dilated convolution is applied exclusively along the vertical (stripe) axis, then the features are flattened, providing directional positional encoding focused on pathologist-relevant features.
Contextual Token Selection (CTS): An auxiliary instance learner head $h_\theta$ predicts logits $g_\theta(p_i)$ for each patch, yielding softmax predictions $\hat{y}_i$ . Per-token entropy $H_i = -\sum_c \hat{y}_{i,c}\,\log \hat{y}_{i,c}$ quantifies uncertainty. The top $r$ \% most-uncertain tokens are masked (with mask $M$ ), and only selected tokens are injected into the SSD update, while masked tokens propagate previous state:

$h_i = \begin{cases} h_{i-1}, & i \in \mathbb{S}_r \ \bar{A}_i h_{i-1} + \bar{B}_i p_i, & \text{otherwise} \end{cases}$

This mechanism prioritizes discriminative evidence and mitigates exponential memory decay, addressing a key SSM limitation (Zeng et al., 19 Dec 2025).

3. Algorithmic Workflow

A typical MambaMIL+ pipeline proceeds as follows:

Patch Extraction: WSI is tiled into a fourfold overlapped sequence ( $Z \in \mathbb{R}^{4N \times D}$ ).
Feature Extraction: Each patch is converted to a $D$ -dimensional vector via a pretrained CNN or vision transformer.
Instance Learner & CTS: Logit head predicts $\hat{y}_i$ , computes $H_i$ , masks top uncertainties to form $M$ .
Spatial Encoding (S²PE): Masked features are mapped to a grid, striped-convolved, and flattened.
Sequence Modeling: S²PE-enhanced tokens are processed through the SSD, using CTS-masked updates.
Aggregation & Prediction: The final SSD hidden state $h_{4N}$ is fed to a slide-level MLP or Cox head, depending on the task.
Loss Function: End-to-end training via bag-level cross-entropy (classification), Cox partial likelihood (survival), and optional instance-level auxiliary loss.

Key hyperparameters include patch size ( $P=224$ px), overlap ( $o=112$ px), SSD memory window (full sequence), CTS masking ratio ( $r\in[5\%,25\%]$ ), and S²PE kernel/dilation (kernel=3, dilation=2). The optimizer is Adam, and evaluation is via up to $5 \times 5$ -fold cross-validation (Zeng et al., 19 Dec 2025).

4. Empirical Performance and Results

MambaMIL+ was evaluated on 20 benchmarks spanning diagnostic classification (e.g., BRACS-7, Camelyon16/17, NSCLC), molecular prediction (e.g., BRCA-PAM50, CRC-Molecular), and survival analysis (multiple TCGA cohorts). The following improvements were observed relative to previous best approaches:

Diagnostic Classification:
- With ResNet-50: +1.5% AUC, +2.9% ACC, +3.0% F1-score
- With PLIP: +0.9% AUC, +1.1% ACC, +0.9% F1
- With CONCH: +0.5% AUC, +2.0% ACC, +1.1% F1
Molecular Prediction:
- Mean +1.4–3.4% AUC (with largest absolute gains for CONCH features)
Survival Analysis:
- With ResNet-50: +1.5% C-Index
- With PLIP: +0.6% C-Index
- With CONCH: +0.7% C-Index

All reported gains are statistically significant at $p<0.05$ (paired t-test). Ablation studies indicate that each of the three core components—overlapping scanning, CTS, and S²PE—contributes independently, and their combination yields maximal gain: for example, on five diagnostic tasks, MambaMIL+ achieves +1.9% AUC, +3.6% ACC, +3.6% F1 over a vanilla Mamba backbone (Zeng et al., 19 Dec 2025).

5. Technical Ablations and Design Analysis

Ablation studies in (Zeng et al., 19 Dec 2025) revealed:

Overlapping Scanning alone improved AUC by +0.9%
CTS alone improved AUC by +0.8%
S²PE alone improved AUC by +0.7%
Co-application of all three yielded the maximal observed improvement.
CTS using entropy-based selection outperformed random, uniform, attention-driven, or pre-pruning alternatives by ≈1.5% AUC.
S²PE provided a unique benefit due to the structured directional encoding. Generic position encodings (e.g., Transformers’ sinusoidal) degraded Mamba performance, suggesting Mamba’s position modeling requires adaptively matched schemes.
The quadrupling of token count from overlapping scanning increased memory consumption. This suggests memory optimization remains an open direction.

MambaMIL+ exists in the context of a spectrum of Mamba/MIL innovations. Noteworthy related directions:

LBMamba (Locally Bi-Directional Mamba): LBMamba enables bidirectional context by embedding a thread-private local backward scan. When integrated into MambaMIL as MambaMIL+, this yields up to +3.06% AUC and +3.39% F1 over vanilla MambaMIL with only ≈2% additional per-block runtime (Zhang et al., 19 Jun 2025). Alternate bidirectional mechanisms (e.g., global bi-Mamba) increase cost more substantially.
Sequence Reordering (SR-Mamba): Exposes SSMs to differently ordered contexts, improving discriminative feature mining in sparse, scattered regions, and reducing overfitting; however, spatial context is still limited compared to explicit spatial enhancements in MambaMIL+ (Yang et al., 11 Mar 2024).
Non-pathology Applications: MambaMIL+ (as implemented for robotic imitation learning) demonstrates that the compact SSM-based encoding delivers superior real-world control robustness and output smoothness versus LSTMs and Transformers, though the primary architectural innovations (overlapping scan, S²PE, CTS) are specific to the computational pathology use case (Tsuji, 4 Sep 2024).

7. Limitations and Future Directions

Identified limitations and forward-looking areas include:

Overlapping scanning increases computational and memory demands; future work may explore learned, sparse, or adaptive overlapping schemes.
The sequential scan, even when spatialized, may retain a strong order bias. Full 2D state-space models could further mitigate this, though they must address slide regions without tissue.
CTS depends on the reliability of the instance learner; improved instance-level supervision, or active/semi-supervised learning, could further enhance results.
The methodology’s generalization to rare cancer types, multimodal data integration (e.g., clinical metadata), and further efficiency refinements remain active research avenues (Zeng et al., 19 Dec 2025).

References

"MambaMIL+: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image" (Zeng et al., 19 Dec 2025)
"LBMamba: Locally Bi-directional Mamba" (Zhang et al., 19 Jun 2025)
"MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology" (Yang et al., 11 Mar 2024)
"Mamba as a motion encoder for robotic imitation learning" (Tsuji, 4 Sep 2024)