Weak-Mamba-UNet: Weakly Supervised Segmentation

Updated 16 April 2026

The paper introduces a weakly supervised segmentation framework that fuses CNN, ViT, and state-space modules to capture local and global image features.
It leverages a collaborative cross-supervisory loop with partial cross-entropy and pseudo-label Dice loss to refine segmentation from limited scribble annotations.
Experimental results on the ACDC MRI dataset demonstrate superior performance with a mean Dice of 0.9171 and a significantly reduced Hausdorff Distance.

Weak-Mamba-UNet is a weakly supervised medical image segmentation framework that fuses Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Visual Mamba (VMamba) architectures within a collaborative, cross-supervisory learning loop. It is specifically designed for applications where annotations are sparse or imprecise, such as scribble-based labels, by leveraging the complementary strengths of encoder–decoder networks built from convolution, attention, and state-space modules (Wang et al., 2024).

1. Constituent Architectures

Weak-Mamba-UNet comprises three U-shaped encoder–decoder subnetworks, termed “views,” each adopting a distinct architectural paradigm yet sharing an identical high-level connection structure. The three subnetworks are:

CNN-based UNet: Employs four-level downsampling and upsampling pathways, with double 3×3 convolutions, ReLU activations, and Batch Normalization at each stage. Skip connections concatenate feature maps from encoder to decoder, optimizing for local spatial detail.
Swin Transformer-based SwinUNet: Utilizes patch embedding, shifted-window multi-head self-attention (SWA), and standard MLP blocks in a three-level encoder–decoder structure. Patch merging and skip connects preserve multi-scale context; the SWA layers efficiently harvest global contextual dependencies via windowed self-attention.
VMamba-based Mamba-UNet: Implements state-space modeling via Visual Mamba blocks for long-range spatial dependencies. The encoder and decoder both operate in three resolution stages. Each VMamba block is mathematically formulated as a continuous-discrete SSM:

$\frac{dh(t)}{dt} = A h(t) + B u(t),\quad y(t) = C h(t) + D u(t)$

Discretized (with $\Delta t=1$ ), yielding:

$h_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k$

where $x_k$ is the input feature (flattened token), $h_k$ is the hidden state, and $(\Phi, \Gamma, C, D)$ are learned. Outputs are post-processed via feed-forward networks and normalization layers.

This tripartite architectural ensemble is essential to achieving complementary modeling of local detail (CNN), global context (ViT), and efficient long-range interactions (VMamba).

2. Collaborative and Cross-Supervisory Training Mechanism

The distinguishing hallmark of Weak-Mamba-UNet is its collaborative, cross-supervised weakly-supervised learning protocol. The system is trained with two forms of supervision:

Partial Cross-Entropy Loss (pCE) on scribble-labeled pixels: The loss is applied only to pixels annotated in the scribbled mask set $\Omega_L$ :

$\mathcal{L}_{\mathrm{pce}}^i = -\sum_{p\in\Omega_L}\sum_{k=1}^K y^{(p)}_{\mathrm{s},k}\log(y^{(p)}_{\mathrm{p},k})$

where $y_{\mathrm{s}}$ denotes the sparse annotation, and $y_{\mathrm{p}}$ is the predicted softmax probability for class $\Delta t=1$ 0.

Pseudo-Label Dice Loss across the ensemble: At each iteration, each subnetwork predicts a segmentation mask, and a soft pseudo-label is synthesized by convex combination:

$\Delta t=1$ 1

with $\Delta t=1$ 2 and $\Delta t=1$ 3 per batch. For each network $\Delta t=1$ 4, the Dice loss to this pseudo-label is:

$\Delta t=1$ 5

The final objective summed over all models is:

$\Delta t=1$ 6

This loop enables each subnetwork to iteratively refine itself and its peers via mutual pseudo-label supervision, which is crucial for learning from highly incomplete annotation.

3. Algorithmic Workflow and Inference

The Weak-Mamba-UNet workflow consists of the following steps:

Forward Pass: Each network processes input $\Delta t=1$ 7 to yield $\Delta t=1$ 8.
Pseudo-Label Construction: New convex weights are drawn, and $\Delta t=1$ 9 is computed.
Loss Computation: Each network is penalized with $h_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k$ 0 on labeled pixels and $h_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k$ 1 with respect to $h_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k$ 2 over the entire image.
Parameter Update: Stochastic gradient descent steps update each network’s parameters.
Test-Time Output: At inference, the final prediction is the argmax after averaging the logits from the three subnetworks:

$h_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k$ 3

Optionally, the VMamba-based model alone may be used for efficient deployment.

4. Scribble-Based Annotation Paradigm

Scribble annotation provides weak spatial supervision by marking 1–3-pixel-wide lines within each label region. The mask preprocessing strategy follows Valvano et al. (2021), converting dense ground truth into sparse, narrow strokes that annotate only ≈2% of the available pixels. These sparse labels serve as hard anchors for loss calculation, allowing the model to explore plausible segmentations in the vast unlabeled regions by relying on model-driven consistency and pseudo-label refinement.

5. Experimental Results and Quantitative Assessment

Empirical validation on the ACDC MRI cardiac segmentation dataset (100 patients, 4 anatomical classes, 224×224 resolution) demonstrates the efficacy of the Weak-Mamba-UNet strategy. Metrics include mean Dice, Accuracy, Precision, Sensitivity, Specificity (higher is better), and 95%-Hausdorff Distance (HD) and Average Surface Distance (ASD, both lower is better). Performance on weak supervision (scribbles) is as follows:

Framework + Network	Dice ↑	Acc ↑	Pre ↑	Sen ↑	Spe ↑	HD ↓	ASD ↓
pCE + UNet	0.7620	0.9807	0.6799	0.9174	0.9823	151.06	54.65
Gated CRF + UNet	0.9046	0.9955	0.8890	0.9304	0.9922	7.43	2.08
Gated CRF + SwinUNet	0.8995	0.9955	0.8920	0.9175	0.9904	6.66	1.62
Weak-Mamba-UNet (ours)	0.9171	0.9963	0.9095	0.9309	0.9920	3.96	0.88

Ablation studies indicate that model heterogeneity (CNN, SwinUNet, VMamba) is essential to reaching peak performance; homogeneous tri-ensembles (such as 3×SwinUNet) underperform considerably (Dice drops to 0.7446).

6. Significance, Insights, and Variations

The integration of state-space modeling via VMamba blocks provides efficient and expressive long-range context propagation at linear computational cost, a critical factor for scalable weak supervision. The cross-supervisory pseudo-label loop mitigates overfitting to sparse supervisory signals and enables robust mask refinement. This collaborative training schema enables the ensemble to outpace single-model weakly supervised approaches and homogeneous ensembles.

A plausible implication is that introducing additional architectural diversity (e.g., fusing other SSMs or explicit uncertainty modeling) could further improve label efficiency in low-annotation regimes.

7. Reproducibility and Implementation

Weak-Mamba-UNet is publicly available (https://github.com/ziyangwang007/Mamba-UNet), is implemented in PyTorch ≥1.8, and requires CUDA 11 support. The repository provides scripts for both training (configurable chosen backbones, annotation paths, hyperparameters) and inference (ensemble or single-subnetwork) (Wang et al., 2024).

References

"Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation" (Wang et al., 2024)

Markdown Report Issue Upgrade to Chat

References (1)

Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weak-Mamba-UNet.