MARBLE: Adaptive Multi-scale Biomedical Encoder

Updated 27 March 2026

MARBLE is a unified framework for whole-slide image analysis that integrates multi-scale feature embeddings with efficient linear-time state-space models.
It processes gigapixel-resolution images in parallel using an explicit coarse-to-fine fusion mechanism, overcoming the quadratic bottlenecks of transformers.
Empirical evaluations on public datasets demonstrate significant improvements in accuracy and AUC, establishing MARBLE as a new baseline for MIL-based WSI analysis.

The Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE) is a unified, end-to-end framework for whole-slide image (WSI) analysis that leverages a purely Mamba-based multi-state multiple instance learning (MIL) approach. MARBLE processes gigapixel-resolution WSIs by integrating parallel, multi-scale analysis with explicit coarse-to-fine fusion of visual features and linear-time, state-space sequence encoders. This methodology eliminates the quadratic computational bottlenecks inherent to transformer-based self-attention models and achieves efficient, scalable, and generalizable multi-scale WSI encoding with minimal parameter overhead (Dwarampudi et al., 2 Feb 2026).

1. Architectural Foundations

MARBLE formalizes WSI input as a multi-resolution pyramid composed of $S+1$ distinct magnifications, indexed from $k=0$ (coarsest) to $k=S$ (finest). For each level $k$ , the raw slide is partitioned into $T_k$ patches, and feature embeddings $\mathbf{X}^{(k)} = [\mathbf{x}_1^{(k)}, \dots, \mathbf{x}_{T_k}^{(k)}]^\top$ , where $\mathbf{x}_i^{(k)} \in \mathbb{R}^D$ , are extracted. Each sequence $\mathbf{X}^{(k)}$ is then encoded in parallel by an independent Mamba-2 block $M_k$ . The overall wall-clock time is therefore dictated by the largest $T_k$ across all scales. The architecture ensures that all scales are processed simultaneously, maintaining modularity and scalability.

A central innovation is explicit coarse-to-fine reasoning: for every resolution level $k > 0$ , each patch token $\mathbf{x}_i^{(k)}$ is conditioned on its unique parent from the preceding coarser level using a learned fusion block. This is implemented as

$\mathbf{c}_i^{(k)} = \mathbf{y}^{(k-1)}_{p_k(i)}, \quad \tilde{\mathbf{x}}_i^{(k)} = \phi^{(k)}([\mathbf{x}_i^{(k)} \Vert \mathbf{c}_i^{(k)}]) = W^{(k)} [\mathbf{x}_i^{(k)}; \mathbf{c}_i^{(k)}] + b^{(k)},$

with $p_k(i)$ mapping fine-level patches to their unique parent. The fused tokens $\tilde{\mathbf{x}}_i^{(k)}$ serve as the input for Mamba-based state-space modeling.

2. State-space Modeling with Mamba

Each parallel stream leverages a discrete-time state-space model, instantiated using the Mamba-2 architecture. For each level $k$ , token $t$ : $\mathbf{h}_t^{(k)} = A^{(k)} \mathbf{h}_{t-1}^{(k)} + B^{(k)} \tilde{\mathbf{x}}_t^{(k)}, \quad \mathbf{y}_t^{(k)} = C^{(k)} \mathbf{h}_t^{(k)} + D^{(k)} \tilde{\mathbf{x}}_t^{(k)},$ with $\mathbf{h}_t^{(k)}, \mathbf{y}_t^{(k)} \in \mathbb{R}^D$ . The state transition matrix $A^{(k)}$ is diagonal or low-rank plus diagonal, while $B^{(k)}, C^{(k)}, D^{(k)}$ are chosen to allow rank-one or sparse factorization, guaranteeing that per-token updates and memory requirements scale linearly as $\mathcal{O}(D)$ . This establishes processing time of $\mathcal{O}(T_k D)$ per level, in contrast to the quadratic time of self-attention mechanisms.

Notably, there is no recurrence or additional cross-level modeling beyond the fused input $\tilde{\mathbf{x}}_t^{(k)}$ , greatly simplifying implementation and parallelization.

3. Coarse-to-fine Conditioning and Regularization

The cross-scale fusion module $\phi^{(k)}$ introduces adaptive gating between fine- and coarse-level features through a learned weight $W^{(k)} \in \mathbb{R}^{D \times 2D}$ . This explicit concatenative conditioning enables direct hierarchical reasoning previously unavailable in single-scale MIL or standard transformer approaches.

Parameter efficiency is maintained: each additional level introduces only $\mathcal{O}(D^2)$ parameters specific to $\phi^{(k)}$ , with the majority concentrated in the shared Mamba state-space blocks. Regularization schemes include:

Random coarse-branch drop: During training, randomly nullifying a fraction $\alpha$ of level-0 (coarse) tokens and pruning subordinate descendants to prevent over-reliance on any single scale and promote generalization.
Scan-order neutrality: Tokens are permuted prior to encoding, enforcing permutation invariance and reducing spatial positional bias.

4. Computational Complexity

MARBLE's linear-time complexity is central to its scalability. For $K$ scales and input sequence lengths $T_k$ , the following holds:

Component	MARBLE Complexity	Transformer Complexity
Per-level computation	$\mathcal{O}(T_k D)$	$\mathcal{O}(T_k^2 D)$
Overall wall-clock (parallel)	$\mathcal{O}(\max_k T_k D)$	$\mathcal{O}(T^2 D)$
Total memory (all levels)	$\mathcal{O}(\sum_k T_k D)$	$\mathcal{O}(T^2)$

MARBLE thus provides true linear-time encoding over both spatial and magnification axes, enabling practical analysis of gigapixel WSIs spanning multiple scales. This represents a substantial computational advantage over standard transformer-based models, whose self-attention matrices scale quadratically with input size (Dwarampudi et al., 2 Feb 2026).

5. Empirical Evaluation and Comparative Performance

MARBLE has been empirically validated on five public WSI datasets: PANDA prostate, TCGA-NSCLC, and three TCGA survival cohorts (KIRP, LUAD, STAD). Relative to established MIL baselines—ABMIL, CLAM, TransMIL, S4-MIL, MambaMIL variants—MARBLE consistently achieves state-of-the-art or superior results:

PANDA: Accuracy gain of +20.25 percentage points (0.5075 → 0.7100); AUC gain of +6.94 points (0.8184 → 0.8878).
TCGA-NSCLC: Accuracy +1.16 (0.8850 → 0.8966); AUC +1.04 (0.9626 → 0.9730).
Survival cohorts (KIRP, LUAD, STAD): Average C-index increase of +2.3 points (e.g., STAD: 0.6428 → 0.6510).

Ablation studies confirm that the two-scale ("coarse + fine") variant outperforms either single-magnification model along multiple metrics, substantiating the efficacy of explicit cross-scale conditioning.

6. Significance and Context within the Field

MARBLE demonstrates that parallel, explicit multi-scale processing with linear-time state-space modeling can overcome both the memory and computational limitations of attention-based frameworks for WSI, without sacrificing representational capacity or generalization. Its architecture provides a reproducible template for scalable analysis of multi-resolution biomedical images and establishes a new performance baseline for MIL-based WSI modeling.

This approach invites potential broadening to other domains where multi-scale data and efficient sequence modeling are essential and suggests a paradigm shift away from reliance on self-attention toward efficient state-space mechanisms for large-scale biomedical analysis (Dwarampudi et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MARBLE (Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder).