Mamba-Based MIL Encoder
- The Mamba-based MIL encoder is a neural model that replaces quadratic attention with linear-time state-space recurrences for efficient MIL.
- It employs multi-scale fusion and sequence reordering to capture long-range dependencies in whole slide image analysis.
- These architectures offer improved parameter efficiency and predictive performance over conventional transformer-based methods on pathology benchmarks.
A Mamba-based MIL encoder is a class of neural sequence models for Multiple Instance Learning (MIL), constructed around Mamba state space modules. These encoders—also termed selective scan state-space models—replace attention mechanisms with linear-time, recurrent state updates, offering computational efficiency and the ability to model long-range dependencies in massive instance sets such as those arising in Whole Slide Image (WSI) analysis. Mamba-based MIL frameworks have rapidly advanced, enabling scalable and modular architectures that outperform standard transformer-based methods in both efficiency and, in many cases, predictive performance across diverse computational pathology benchmarks (Dwarampudi et al., 2 Feb 2026, Yang et al., 2024, Zhang et al., 19 Jun 2025, Khan et al., 25 Sep 2025).
1. Architecture of Mamba-based MIL Encoders
Mamba-based MIL encoders process bags of instances—typically image patches or tiles—by linearly projecting the features into a latent space and then passing them through stacks of Mamba blocks. Each Mamba block implements a discretized state-space recurrence. In standard architectures such as MambaMIL and MARBLE, the patch/tile features are first extracted by a frozen backbone (e.g., ResNet-50, PLIP, or UNI), projected to a lower-dimensional embedding, and organized into sequences. The subsequent encoder stack may employ various enhancements:
- Parallel Multi-scale Fusion (MARBLE): For WSIs with hierarchical magnifications, MARBLE tiles the image into sets of patches at multiple scales. Mamba-2 blocks encode each level in parallel, while finer levels incorporate coarser-level parent context through token-aligned fusion driven by lightweight linear projections and optional gating (Dwarampudi et al., 2 Feb 2026).
- Sequence Reordering (SR-Mamba): In MambaMIL, sequences are processed both in their original order and in a permuted (reordered) form, via parallel Mamba blocks, to capture non-local dependencies and augment feature expressiveness (Yang et al., 2024).
- Hybrid GNN-Mamba (SlideMamba): This variant combines a GNN branch with a Mamba-based sequence branch, fusing their outputs via entropy-based adaptive weighting (Khan et al., 25 Sep 2025).
Pooling over the resulting instance representations employs either global mean/max, attention pooling, or latent-query aggregation, followed by an MLP head for slide-level or survival prediction.
2. State-space Recurrence and Algorithmic Complexity
The fundamental mathematical core of a Mamba-based MIL encoder is the discretized linear state-space recurrence. Given input tokens , hidden states , and output tokens , the minimal formulation is:
where , often factorized or diagonalized for efficiency. The convolutional view interprets the model as a 1D convolution with a learnable kernel, yielding the same mapping as the recurrence but enabling parallel computation (Dwarampudi et al., 2 Feb 2026, Yang et al., 2024).
Key algorithmic characteristics include:
- Linear Complexity: Each Mamba block processes a sequence of length in time and memory, removing the bottleneck of self-attention.
- Input-dependent Parameters: Mamba's selective scan is realized by making input-dependent, typically via small gating networks or selection modules (Yang et al., 2024, Zhang et al., 19 Jun 2025).
- Cross-scale Fusion: Multi-scale variants, such as MARBLE, employ level-specific linear projections to merge local features with parent (coarser) context prior to Mamba encoding, optionally modulated by a gating scalar (Dwarampudi et al., 2 Feb 2026).
3. Enhancements: Multi-scale, Bidirectional, and Fusion Strategies
Several innovations have emerged to improve the representational power and practicality of Mamba-based MIL encoders:
- MARBLE (Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder): MARBLE encodes each magnification level in WSIs in parallel, integrating coarse context into finer levels through direct parent-feature fusion. The projection function
with optional gating, allows for coarse-to-fine reasoning in a single pass (Dwarampudi et al., 2 Feb 2026).
- LBMamba (Locally Bi-directional Mamba): To mitigate the receptive field limitations of unidirectional SSMs, LBMamba introduces local backward scans within each CUDA thread's window, re-establishing local bidirectionality while maintaining hardware efficiency. In stacked LBVim backbones, global bidirectionality is achieved by alternating sequence reversal at each layer (Zhang et al., 19 Jun 2025).
- Sequence Reordering (SR-Mamba): In SR-Mamba, the input is processed via a dual-branch arrangement: the standard order and a segment-wise permuted form, serving as implicit feature-level augmentation and enhancing generalization (Yang et al., 2024).
- Entropy-based Branch Fusion (SlideMamba): SlideMamba dynamically fuses the outputs of a GNN branch and a Mamba branch by estimating their prediction entropies, adaptively weighting according to branch confidence at each block (Khan et al., 25 Sep 2025).
4. MIL Pooling, Output Heads, and Objective Functions
After sequence encoding, instance representations are aggregated into a slide- or bag-level summary through MIL pooling mechanisms:
- Attention Pooling: Standard in MARBLE and MambaMIL, a learnable attention vector produces weights over instances, and the summary vector is computed as weighted sum (Dwarampudi et al., 2 Feb 2026, Yang et al., 2024).
- Global Mean/Max Pooling: SlideMamba and some configurations of MARBLE use mean or max pooling over all instances (Khan et al., 25 Sep 2025).
- Latent-query Aggregators: LBVim backbones utilize a linear attention mechanism with learnable queries for aggregation (Zhang et al., 19 Jun 2025).
Final prediction heads are selected per task: for categorical classification, a small MLP with softmax and cross-entropy loss; for survival prediction, a Cox proportional-hazard head with concordance index as the evaluation metric (Yang et al., 2024, Dwarampudi et al., 2 Feb 2026).
5. Parameter Efficiency and Scalability
Mamba-based MIL encoders are intrinsically more parameter- and compute-efficient than transformer-based counterparts:
- No Quadratic Attention Matrices: All SSM and fusion operations are at most , with cross-scale fusion requiring additional parameters per extra scale in MARBLE (Dwarampudi et al., 2 Feb 2026).
- No Positional Embeddings: SSMs inherently model sequence order, omitting the need for explicit position encoding in most configurations (Yang et al., 2024).
- Overfitting Resistance: Techniques such as dual-order processing (SR-Mamba), linear projection to reduced rank, and sequence-level augmentation have been reported to enhance generalization and stabilize training (Yang et al., 2024).
In practice, the total parameter count for architectures such as MARBLE is dominated by the Mamba-2 block parameters () and is only modestly increased by additional scales or fusion layers.
6. Empirical Results and Comparative Performance
Mamba-based MIL encoders have demonstrated strong and consistent performance gains across multiple public WSI datasets. As summarized in the following table, systems such as MARBLE (Dwarampudi et al., 2 Feb 2026), LBMamba (Zhang et al., 19 Jun 2025), SlideMamba (Khan et al., 25 Sep 2025), and MambaMIL (Yang et al., 2024) report improved metrics in AUC, accuracy, and C-index relative to transformer-based or conventional MIL approaches—often with reduced compute budgets.
| Model | Linear-Time | Multi-Scale | Bidirectional | AUC Gain | Reference |
|---|---|---|---|---|---|
| MARBLE | ✓ | ✓ | No | +6.9% | (Dwarampudi et al., 2 Feb 2026) |
| LBMambaMIL | ✓ | No | ✓ (local) | +1.67% | (Zhang et al., 19 Jun 2025) |
| MambaMIL (SR) | ✓ | No | No | +2.6–2.7% | (Yang et al., 2024) |
| SlideMamba | ✓ | No | No | +0.087–0.361 | (Khan et al., 25 Sep 2025) |
Reported improvements correspond to the best gains on select datasets and metrics versus strong MIL baselines.
7. Implementation Considerations and Best Practices
Implementation of Mamba-based MIL encoders leverages available SSM/Mamba modules in standard deep learning frameworks:
- Layer Construction: Each scale or sequence level is mapped to a Mamba-2/S4 module; fusion between levels (MARBLE) is implemented with a small linear layer and parent-index bookkeeping (Dwarampudi et al., 2 Feb 2026).
- Branch Integration: For hybrid models (e.g., SlideMamba), feature fusion is performed blockwise, with entropy weights estimated from per-branch softmax outputs (Khan et al., 25 Sep 2025).
- Training Protocols: Typical optimizers are AdamW, with batch size 1 (MIL paradigm), dropout and batch normalization applied at each stage, and standard learning rate schedules.
Empirical results suggest that, combined with appropriate architectural scaling (number of layers, dimension, local window sizes in LBMamba), these models provide both efficient and highly accurate solutions to gigapixel-scale WSI analysis and related applications in computational pathology.