Papers
Topics
Authors
Recent
Search
2000 character limit reached

MARBLE: Adaptive Multi-scale Biomedical Encoder

Updated 27 March 2026
  • MARBLE is a unified framework for whole-slide image analysis that integrates multi-scale feature embeddings with efficient linear-time state-space models.
  • It processes gigapixel-resolution images in parallel using an explicit coarse-to-fine fusion mechanism, overcoming the quadratic bottlenecks of transformers.
  • Empirical evaluations on public datasets demonstrate significant improvements in accuracy and AUC, establishing MARBLE as a new baseline for MIL-based WSI analysis.

The Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE) is a unified, end-to-end framework for whole-slide image (WSI) analysis that leverages a purely Mamba-based multi-state multiple instance learning (MIL) approach. MARBLE processes gigapixel-resolution WSIs by integrating parallel, multi-scale analysis with explicit coarse-to-fine fusion of visual features and linear-time, state-space sequence encoders. This methodology eliminates the quadratic computational bottlenecks inherent to transformer-based self-attention models and achieves efficient, scalable, and generalizable multi-scale WSI encoding with minimal parameter overhead (Dwarampudi et al., 2 Feb 2026).

1. Architectural Foundations

MARBLE formalizes WSI input as a multi-resolution pyramid composed of S+1S+1 distinct magnifications, indexed from k=0k=0 (coarsest) to k=Sk=S (finest). For each level kk, the raw slide is partitioned into TkT_k patches, and feature embeddings X(k)=[x1(k),,xTk(k)]\mathbf{X}^{(k)} = [\mathbf{x}_1^{(k)}, \dots, \mathbf{x}_{T_k}^{(k)}]^\top, where xi(k)RD\mathbf{x}_i^{(k)} \in \mathbb{R}^D, are extracted. Each sequence X(k)\mathbf{X}^{(k)} is then encoded in parallel by an independent Mamba-2 block MkM_k. The overall wall-clock time is therefore dictated by the largest TkT_k across all scales. The architecture ensures that all scales are processed simultaneously, maintaining modularity and scalability.

A central innovation is explicit coarse-to-fine reasoning: for every resolution level k>0k > 0, each patch token xi(k)\mathbf{x}_i^{(k)} is conditioned on its unique parent from the preceding coarser level using a learned fusion block. This is implemented as

ci(k)=ypk(i)(k1),x~i(k)=ϕ(k)([xi(k)ci(k)])=W(k)[xi(k);ci(k)]+b(k),\mathbf{c}_i^{(k)} = \mathbf{y}^{(k-1)}_{p_k(i)}, \quad \tilde{\mathbf{x}}_i^{(k)} = \phi^{(k)}([\mathbf{x}_i^{(k)} \Vert \mathbf{c}_i^{(k)}]) = W^{(k)} [\mathbf{x}_i^{(k)}; \mathbf{c}_i^{(k)}] + b^{(k)},

with pk(i)p_k(i) mapping fine-level patches to their unique parent. The fused tokens x~i(k)\tilde{\mathbf{x}}_i^{(k)} serve as the input for Mamba-based state-space modeling.

2. State-space Modeling with Mamba

Each parallel stream leverages a discrete-time state-space model, instantiated using the Mamba-2 architecture. For each level kk, token tt: ht(k)=A(k)ht1(k)+B(k)x~t(k),yt(k)=C(k)ht(k)+D(k)x~t(k),\mathbf{h}_t^{(k)} = A^{(k)} \mathbf{h}_{t-1}^{(k)} + B^{(k)} \tilde{\mathbf{x}}_t^{(k)}, \quad \mathbf{y}_t^{(k)} = C^{(k)} \mathbf{h}_t^{(k)} + D^{(k)} \tilde{\mathbf{x}}_t^{(k)}, with ht(k),yt(k)RD\mathbf{h}_t^{(k)}, \mathbf{y}_t^{(k)} \in \mathbb{R}^D. The state transition matrix A(k)A^{(k)} is diagonal or low-rank plus diagonal, while B(k),C(k),D(k)B^{(k)}, C^{(k)}, D^{(k)} are chosen to allow rank-one or sparse factorization, guaranteeing that per-token updates and memory requirements scale linearly as O(D)\mathcal{O}(D). This establishes processing time of O(TkD)\mathcal{O}(T_k D) per level, in contrast to the quadratic time of self-attention mechanisms.

Notably, there is no recurrence or additional cross-level modeling beyond the fused input x~t(k)\tilde{\mathbf{x}}_t^{(k)}, greatly simplifying implementation and parallelization.

3. Coarse-to-fine Conditioning and Regularization

The cross-scale fusion module ϕ(k)\phi^{(k)} introduces adaptive gating between fine- and coarse-level features through a learned weight W(k)RD×2DW^{(k)} \in \mathbb{R}^{D \times 2D}. This explicit concatenative conditioning enables direct hierarchical reasoning previously unavailable in single-scale MIL or standard transformer approaches.

Parameter efficiency is maintained: each additional level introduces only O(D2)\mathcal{O}(D^2) parameters specific to ϕ(k)\phi^{(k)}, with the majority concentrated in the shared Mamba state-space blocks. Regularization schemes include:

  • Random coarse-branch drop: During training, randomly nullifying a fraction α\alpha of level-0 (coarse) tokens and pruning subordinate descendants to prevent over-reliance on any single scale and promote generalization.
  • Scan-order neutrality: Tokens are permuted prior to encoding, enforcing permutation invariance and reducing spatial positional bias.

4. Computational Complexity

MARBLE's linear-time complexity is central to its scalability. For KK scales and input sequence lengths TkT_k, the following holds:

Component MARBLE Complexity Transformer Complexity
Per-level computation O(TkD)\mathcal{O}(T_k D) O(Tk2D)\mathcal{O}(T_k^2 D)
Overall wall-clock (parallel) O(maxkTkD)\mathcal{O}(\max_k T_k D) O(T2D)\mathcal{O}(T^2 D)
Total memory (all levels) O(kTkD)\mathcal{O}(\sum_k T_k D) O(T2)\mathcal{O}(T^2)

MARBLE thus provides true linear-time encoding over both spatial and magnification axes, enabling practical analysis of gigapixel WSIs spanning multiple scales. This represents a substantial computational advantage over standard transformer-based models, whose self-attention matrices scale quadratically with input size (Dwarampudi et al., 2 Feb 2026).

5. Empirical Evaluation and Comparative Performance

MARBLE has been empirically validated on five public WSI datasets: PANDA prostate, TCGA-NSCLC, and three TCGA survival cohorts (KIRP, LUAD, STAD). Relative to established MIL baselines—ABMIL, CLAM, TransMIL, S4-MIL, MambaMIL variants—MARBLE consistently achieves state-of-the-art or superior results:

  • PANDA: Accuracy gain of +20.25 percentage points (0.5075 → 0.7100); AUC gain of +6.94 points (0.8184 → 0.8878).
  • TCGA-NSCLC: Accuracy +1.16 (0.8850 → 0.8966); AUC +1.04 (0.9626 → 0.9730).
  • Survival cohorts (KIRP, LUAD, STAD): Average C-index increase of +2.3 points (e.g., STAD: 0.6428 → 0.6510).

Ablation studies confirm that the two-scale ("coarse + fine") variant outperforms either single-magnification model along multiple metrics, substantiating the efficacy of explicit cross-scale conditioning.

6. Significance and Context within the Field

MARBLE demonstrates that parallel, explicit multi-scale processing with linear-time state-space modeling can overcome both the memory and computational limitations of attention-based frameworks for WSI, without sacrificing representational capacity or generalization. Its architecture provides a reproducible template for scalable analysis of multi-resolution biomedical images and establishes a new performance baseline for MIL-based WSI modeling.

This approach invites potential broadening to other domains where multi-scale data and efficient sequence modeling are essential and suggests a paradigm shift away from reliance on self-attention toward efficient state-space mechanisms for large-scale biomedical analysis (Dwarampudi et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MARBLE (Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder).