Weak-Mamba-UNet: Weakly Supervised Segmentation
- The paper introduces a weakly supervised segmentation framework that fuses CNN, ViT, and state-space modules to capture local and global image features.
- It leverages a collaborative cross-supervisory loop with partial cross-entropy and pseudo-label Dice loss to refine segmentation from limited scribble annotations.
- Experimental results on the ACDC MRI dataset demonstrate superior performance with a mean Dice of 0.9171 and a significantly reduced Hausdorff Distance.
Weak-Mamba-UNet is a weakly supervised medical image segmentation framework that fuses Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Visual Mamba (VMamba) architectures within a collaborative, cross-supervisory learning loop. It is specifically designed for applications where annotations are sparse or imprecise, such as scribble-based labels, by leveraging the complementary strengths of encoder–decoder networks built from convolution, attention, and state-space modules (Wang et al., 2024).
1. Constituent Architectures
Weak-Mamba-UNet comprises three U-shaped encoder–decoder subnetworks, termed “views,” each adopting a distinct architectural paradigm yet sharing an identical high-level connection structure. The three subnetworks are:
- CNN-based UNet: Employs four-level downsampling and upsampling pathways, with double 3×3 convolutions, ReLU activations, and Batch Normalization at each stage. Skip connections concatenate feature maps from encoder to decoder, optimizing for local spatial detail.
- Swin Transformer-based SwinUNet: Utilizes patch embedding, shifted-window multi-head self-attention (SWA), and standard MLP blocks in a three-level encoder–decoder structure. Patch merging and skip connects preserve multi-scale context; the SWA layers efficiently harvest global contextual dependencies via windowed self-attention.
- VMamba-based Mamba-UNet: Implements state-space modeling via Visual Mamba blocks for long-range spatial dependencies. The encoder and decoder both operate in three resolution stages. Each VMamba block is mathematically formulated as a continuous-discrete SSM:
Discretized (with ), yielding:
where is the input feature (flattened token), is the hidden state, and are learned. Outputs are post-processed via feed-forward networks and normalization layers.
This tripartite architectural ensemble is essential to achieving complementary modeling of local detail (CNN), global context (ViT), and efficient long-range interactions (VMamba).
2. Collaborative and Cross-Supervisory Training Mechanism
The distinguishing hallmark of Weak-Mamba-UNet is its collaborative, cross-supervised weakly-supervised learning protocol. The system is trained with two forms of supervision:
- Partial Cross-Entropy Loss (
pCE) on scribble-labeled pixels: The loss is applied only to pixels annotated in the scribbled mask set :
where denotes the sparse annotation, and is the predicted softmax probability for class 0.
- Pseudo-Label Dice Loss across the ensemble: At each iteration, each subnetwork predicts a segmentation mask, and a soft pseudo-label is synthesized by convex combination:
1
with 2 and 3 per batch. For each network 4, the Dice loss to this pseudo-label is:
5
The final objective summed over all models is:
6
This loop enables each subnetwork to iteratively refine itself and its peers via mutual pseudo-label supervision, which is crucial for learning from highly incomplete annotation.
3. Algorithmic Workflow and Inference
The Weak-Mamba-UNet workflow consists of the following steps:
- Forward Pass: Each network processes input 7 to yield 8.
- Pseudo-Label Construction: New convex weights are drawn, and 9 is computed.
- Loss Computation: Each network is penalized with 0 on labeled pixels and 1 with respect to 2 over the entire image.
- Parameter Update: Stochastic gradient descent steps update each network’s parameters.
- Test-Time Output: At inference, the final prediction is the argmax after averaging the logits from the three subnetworks:
3
Optionally, the VMamba-based model alone may be used for efficient deployment.
4. Scribble-Based Annotation Paradigm
Scribble annotation provides weak spatial supervision by marking 1–3-pixel-wide lines within each label region. The mask preprocessing strategy follows Valvano et al. (2021), converting dense ground truth into sparse, narrow strokes that annotate only ≈2% of the available pixels. These sparse labels serve as hard anchors for loss calculation, allowing the model to explore plausible segmentations in the vast unlabeled regions by relying on model-driven consistency and pseudo-label refinement.
5. Experimental Results and Quantitative Assessment
Empirical validation on the ACDC MRI cardiac segmentation dataset (100 patients, 4 anatomical classes, 224×224 resolution) demonstrates the efficacy of the Weak-Mamba-UNet strategy. Metrics include mean Dice, Accuracy, Precision, Sensitivity, Specificity (higher is better), and 95%-Hausdorff Distance (HD) and Average Surface Distance (ASD, both lower is better). Performance on weak supervision (scribbles) is as follows:
| Framework + Network | Dice ↑ | Acc ↑ | Pre ↑ | Sen ↑ | Spe ↑ | HD ↓ | ASD ↓ |
|---|---|---|---|---|---|---|---|
| pCE + UNet | 0.7620 | 0.9807 | 0.6799 | 0.9174 | 0.9823 | 151.06 | 54.65 |
| Gated CRF + UNet | 0.9046 | 0.9955 | 0.8890 | 0.9304 | 0.9922 | 7.43 | 2.08 |
| Gated CRF + SwinUNet | 0.8995 | 0.9955 | 0.8920 | 0.9175 | 0.9904 | 6.66 | 1.62 |
| Weak-Mamba-UNet (ours) | 0.9171 | 0.9963 | 0.9095 | 0.9309 | 0.9920 | 3.96 | 0.88 |
Ablation studies indicate that model heterogeneity (CNN, SwinUNet, VMamba) is essential to reaching peak performance; homogeneous tri-ensembles (such as 3×SwinUNet) underperform considerably (Dice drops to 0.7446).
6. Significance, Insights, and Variations
The integration of state-space modeling via VMamba blocks provides efficient and expressive long-range context propagation at linear computational cost, a critical factor for scalable weak supervision. The cross-supervisory pseudo-label loop mitigates overfitting to sparse supervisory signals and enables robust mask refinement. This collaborative training schema enables the ensemble to outpace single-model weakly supervised approaches and homogeneous ensembles.
A plausible implication is that introducing additional architectural diversity (e.g., fusing other SSMs or explicit uncertainty modeling) could further improve label efficiency in low-annotation regimes.
7. Reproducibility and Implementation
Weak-Mamba-UNet is publicly available (https://github.com/ziyangwang007/Mamba-UNet), is implemented in PyTorch ≥1.8, and requires CUDA 11 support. The repository provides scripts for both training (configurable chosen backbones, annotation paths, hyperparameters) and inference (ensemble or single-subnetwork) (Wang et al., 2024).
References
- "Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation" (Wang et al., 2024)