MARBLE: Adaptive Multi-scale Biomedical Encoder
- MARBLE is a unified framework for whole-slide image analysis that integrates multi-scale feature embeddings with efficient linear-time state-space models.
- It processes gigapixel-resolution images in parallel using an explicit coarse-to-fine fusion mechanism, overcoming the quadratic bottlenecks of transformers.
- Empirical evaluations on public datasets demonstrate significant improvements in accuracy and AUC, establishing MARBLE as a new baseline for MIL-based WSI analysis.
The Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE) is a unified, end-to-end framework for whole-slide image (WSI) analysis that leverages a purely Mamba-based multi-state multiple instance learning (MIL) approach. MARBLE processes gigapixel-resolution WSIs by integrating parallel, multi-scale analysis with explicit coarse-to-fine fusion of visual features and linear-time, state-space sequence encoders. This methodology eliminates the quadratic computational bottlenecks inherent to transformer-based self-attention models and achieves efficient, scalable, and generalizable multi-scale WSI encoding with minimal parameter overhead (Dwarampudi et al., 2 Feb 2026).
1. Architectural Foundations
MARBLE formalizes WSI input as a multi-resolution pyramid composed of distinct magnifications, indexed from (coarsest) to (finest). For each level , the raw slide is partitioned into patches, and feature embeddings , where , are extracted. Each sequence is then encoded in parallel by an independent Mamba-2 block . The overall wall-clock time is therefore dictated by the largest across all scales. The architecture ensures that all scales are processed simultaneously, maintaining modularity and scalability.
A central innovation is explicit coarse-to-fine reasoning: for every resolution level , each patch token is conditioned on its unique parent from the preceding coarser level using a learned fusion block. This is implemented as
with mapping fine-level patches to their unique parent. The fused tokens serve as the input for Mamba-based state-space modeling.
2. State-space Modeling with Mamba
Each parallel stream leverages a discrete-time state-space model, instantiated using the Mamba-2 architecture. For each level , token : with . The state transition matrix is diagonal or low-rank plus diagonal, while are chosen to allow rank-one or sparse factorization, guaranteeing that per-token updates and memory requirements scale linearly as . This establishes processing time of per level, in contrast to the quadratic time of self-attention mechanisms.
Notably, there is no recurrence or additional cross-level modeling beyond the fused input , greatly simplifying implementation and parallelization.
3. Coarse-to-fine Conditioning and Regularization
The cross-scale fusion module introduces adaptive gating between fine- and coarse-level features through a learned weight . This explicit concatenative conditioning enables direct hierarchical reasoning previously unavailable in single-scale MIL or standard transformer approaches.
Parameter efficiency is maintained: each additional level introduces only parameters specific to , with the majority concentrated in the shared Mamba state-space blocks. Regularization schemes include:
- Random coarse-branch drop: During training, randomly nullifying a fraction of level-0 (coarse) tokens and pruning subordinate descendants to prevent over-reliance on any single scale and promote generalization.
- Scan-order neutrality: Tokens are permuted prior to encoding, enforcing permutation invariance and reducing spatial positional bias.
4. Computational Complexity
MARBLE's linear-time complexity is central to its scalability. For scales and input sequence lengths , the following holds:
| Component | MARBLE Complexity | Transformer Complexity |
|---|---|---|
| Per-level computation | ||
| Overall wall-clock (parallel) | ||
| Total memory (all levels) |
MARBLE thus provides true linear-time encoding over both spatial and magnification axes, enabling practical analysis of gigapixel WSIs spanning multiple scales. This represents a substantial computational advantage over standard transformer-based models, whose self-attention matrices scale quadratically with input size (Dwarampudi et al., 2 Feb 2026).
5. Empirical Evaluation and Comparative Performance
MARBLE has been empirically validated on five public WSI datasets: PANDA prostate, TCGA-NSCLC, and three TCGA survival cohorts (KIRP, LUAD, STAD). Relative to established MIL baselines—ABMIL, CLAM, TransMIL, S4-MIL, MambaMIL variants—MARBLE consistently achieves state-of-the-art or superior results:
- PANDA: Accuracy gain of +20.25 percentage points (0.5075 → 0.7100); AUC gain of +6.94 points (0.8184 → 0.8878).
- TCGA-NSCLC: Accuracy +1.16 (0.8850 → 0.8966); AUC +1.04 (0.9626 → 0.9730).
- Survival cohorts (KIRP, LUAD, STAD): Average C-index increase of +2.3 points (e.g., STAD: 0.6428 → 0.6510).
Ablation studies confirm that the two-scale ("coarse + fine") variant outperforms either single-magnification model along multiple metrics, substantiating the efficacy of explicit cross-scale conditioning.
6. Significance and Context within the Field
MARBLE demonstrates that parallel, explicit multi-scale processing with linear-time state-space modeling can overcome both the memory and computational limitations of attention-based frameworks for WSI, without sacrificing representational capacity or generalization. Its architecture provides a reproducible template for scalable analysis of multi-resolution biomedical images and establishes a new performance baseline for MIL-based WSI modeling.
This approach invites potential broadening to other domains where multi-scale data and efficient sequence modeling are essential and suggests a paradigm shift away from reliance on self-attention toward efficient state-space mechanisms for large-scale biomedical analysis (Dwarampudi et al., 2 Feb 2026).