Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weak-Mamba-UNet: Weakly Supervised Segmentation

Updated 16 April 2026
  • The paper introduces a weakly supervised segmentation framework that fuses CNN, ViT, and state-space modules to capture local and global image features.
  • It leverages a collaborative cross-supervisory loop with partial cross-entropy and pseudo-label Dice loss to refine segmentation from limited scribble annotations.
  • Experimental results on the ACDC MRI dataset demonstrate superior performance with a mean Dice of 0.9171 and a significantly reduced Hausdorff Distance.

Weak-Mamba-UNet is a weakly supervised medical image segmentation framework that fuses Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Visual Mamba (VMamba) architectures within a collaborative, cross-supervisory learning loop. It is specifically designed for applications where annotations are sparse or imprecise, such as scribble-based labels, by leveraging the complementary strengths of encoder–decoder networks built from convolution, attention, and state-space modules (Wang et al., 2024).

1. Constituent Architectures

Weak-Mamba-UNet comprises three U-shaped encoder–decoder subnetworks, termed “views,” each adopting a distinct architectural paradigm yet sharing an identical high-level connection structure. The three subnetworks are:

  1. CNN-based UNet: Employs four-level downsampling and upsampling pathways, with double 3×3 convolutions, ReLU activations, and Batch Normalization at each stage. Skip connections concatenate feature maps from encoder to decoder, optimizing for local spatial detail.
  2. Swin Transformer-based SwinUNet: Utilizes patch embedding, shifted-window multi-head self-attention (SWA), and standard MLP blocks in a three-level encoder–decoder structure. Patch merging and skip connects preserve multi-scale context; the SWA layers efficiently harvest global contextual dependencies via windowed self-attention.
  3. VMamba-based Mamba-UNet: Implements state-space modeling via Visual Mamba blocks for long-range spatial dependencies. The encoder and decoder both operate in three resolution stages. Each VMamba block is mathematically formulated as a continuous-discrete SSM:

dh(t)dt=Ah(t)+Bu(t),y(t)=Ch(t)+Du(t)\frac{dh(t)}{dt} = A h(t) + B u(t),\quad y(t) = C h(t) + D u(t)

Discretized (with Δt=1\Delta t=1), yielding:

hk=Φhk1+Γxk,zk=Chk+Dxkh_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k

where xkx_k is the input feature (flattened token), hkh_k is the hidden state, and (Φ,Γ,C,D)(\Phi, \Gamma, C, D) are learned. Outputs are post-processed via feed-forward networks and normalization layers.

This tripartite architectural ensemble is essential to achieving complementary modeling of local detail (CNN), global context (ViT), and efficient long-range interactions (VMamba).

2. Collaborative and Cross-Supervisory Training Mechanism

The distinguishing hallmark of Weak-Mamba-UNet is its collaborative, cross-supervised weakly-supervised learning protocol. The system is trained with two forms of supervision:

  • Partial Cross-Entropy Loss (pCE) on scribble-labeled pixels: The loss is applied only to pixels annotated in the scribbled mask set ΩL\Omega_L:

Lpcei=pΩLk=1Kys,k(p)log(yp,k(p))\mathcal{L}_{\mathrm{pce}}^i = -\sum_{p\in\Omega_L}\sum_{k=1}^K y^{(p)}_{\mathrm{s},k}\log(y^{(p)}_{\mathrm{p},k})

where ysy_{\mathrm{s}} denotes the sparse annotation, and ypy_{\mathrm{p}} is the predicted softmax probability for class Δt=1\Delta t=10.

  • Pseudo-Label Dice Loss across the ensemble: At each iteration, each subnetwork predicts a segmentation mask, and a soft pseudo-label is synthesized by convex combination:

Δt=1\Delta t=11

with Δt=1\Delta t=12 and Δt=1\Delta t=13 per batch. For each network Δt=1\Delta t=14, the Dice loss to this pseudo-label is:

Δt=1\Delta t=15

The final objective summed over all models is:

Δt=1\Delta t=16

This loop enables each subnetwork to iteratively refine itself and its peers via mutual pseudo-label supervision, which is crucial for learning from highly incomplete annotation.

3. Algorithmic Workflow and Inference

The Weak-Mamba-UNet workflow consists of the following steps:

  1. Forward Pass: Each network processes input Δt=1\Delta t=17 to yield Δt=1\Delta t=18.
  2. Pseudo-Label Construction: New convex weights are drawn, and Δt=1\Delta t=19 is computed.
  3. Loss Computation: Each network is penalized with hk=Φhk1+Γxk,zk=Chk+Dxkh_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k0 on labeled pixels and hk=Φhk1+Γxk,zk=Chk+Dxkh_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k1 with respect to hk=Φhk1+Γxk,zk=Chk+Dxkh_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k2 over the entire image.
  4. Parameter Update: Stochastic gradient descent steps update each network’s parameters.
  5. Test-Time Output: At inference, the final prediction is the argmax after averaging the logits from the three subnetworks:

hk=Φhk1+Γxk,zk=Chk+Dxkh_k = \Phi h_{k-1} + \Gamma x_k,\quad z_k = C h_k + D x_k3

Optionally, the VMamba-based model alone may be used for efficient deployment.

4. Scribble-Based Annotation Paradigm

Scribble annotation provides weak spatial supervision by marking 1–3-pixel-wide lines within each label region. The mask preprocessing strategy follows Valvano et al. (2021), converting dense ground truth into sparse, narrow strokes that annotate only ≈2% of the available pixels. These sparse labels serve as hard anchors for loss calculation, allowing the model to explore plausible segmentations in the vast unlabeled regions by relying on model-driven consistency and pseudo-label refinement.

5. Experimental Results and Quantitative Assessment

Empirical validation on the ACDC MRI cardiac segmentation dataset (100 patients, 4 anatomical classes, 224×224 resolution) demonstrates the efficacy of the Weak-Mamba-UNet strategy. Metrics include mean Dice, Accuracy, Precision, Sensitivity, Specificity (higher is better), and 95%-Hausdorff Distance (HD) and Average Surface Distance (ASD, both lower is better). Performance on weak supervision (scribbles) is as follows:

Framework + Network Dice ↑ Acc ↑ Pre ↑ Sen ↑ Spe ↑ HD ↓ ASD ↓
pCE + UNet 0.7620 0.9807 0.6799 0.9174 0.9823 151.06 54.65
Gated CRF + UNet 0.9046 0.9955 0.8890 0.9304 0.9922 7.43 2.08
Gated CRF + SwinUNet 0.8995 0.9955 0.8920 0.9175 0.9904 6.66 1.62
Weak-Mamba-UNet (ours) 0.9171 0.9963 0.9095 0.9309 0.9920 3.96 0.88

Ablation studies indicate that model heterogeneity (CNN, SwinUNet, VMamba) is essential to reaching peak performance; homogeneous tri-ensembles (such as 3×SwinUNet) underperform considerably (Dice drops to 0.7446).

6. Significance, Insights, and Variations

The integration of state-space modeling via VMamba blocks provides efficient and expressive long-range context propagation at linear computational cost, a critical factor for scalable weak supervision. The cross-supervisory pseudo-label loop mitigates overfitting to sparse supervisory signals and enables robust mask refinement. This collaborative training schema enables the ensemble to outpace single-model weakly supervised approaches and homogeneous ensembles.

A plausible implication is that introducing additional architectural diversity (e.g., fusing other SSMs or explicit uncertainty modeling) could further improve label efficiency in low-annotation regimes.

7. Reproducibility and Implementation

Weak-Mamba-UNet is publicly available (https://github.com/ziyangwang007/Mamba-UNet), is implemented in PyTorch ≥1.8, and requires CUDA 11 support. The repository provides scripts for both training (configurable chosen backbones, annotation paths, hyperparameters) and inference (ensemble or single-subnetwork) (Wang et al., 2024).

References

  • "Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation" (Wang et al., 2024)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weak-Mamba-UNet.