Semi-Mamba-UNet: Hybrid SSM Networks

Updated 5 March 2026

Semi-Mamba-UNet is a family of U-shaped encoder-decoder architectures that integrate state-space Mamba (VSS) blocks with the classic UNet framework for efficient segmentation and super-resolution.
It leverages linear-time global context modeling through VSS modules to capture long-range pixel dependencies, enabling robust performance in scenarios with scarce supervision.
Variants ranging from fully VSS-based to hybrid designs demonstrate superior accuracy and efficiency in medical imaging and remote sensing applications, validated by competitive Dice and mIoU scores.

Semi-Mamba-UNet is a family of U-shaped encoder–decoder neural architectures that integrate state-space model (SSM) blocks derived from the Mamba model into the UNet framework for segmentation and super-resolution applications, particularly excelling when labeled data is limited or pixel-level reconstruction of long-range context is essential. This architectural theme appears across several problem domains—including 2D/3D medical image segmentation, high-resolution remote sensing, and medical image super-resolution—encompassing both pure and hybrid designs with semi-supervised learning paradigms (Liu et al., 2024, Ma et al., 2024, Wang et al., 2024, Zhu et al., 2024, Lumetti et al., 2024, Ji et al., 2024, Huang et al., 30 May 2025). The recurrent innovation in these models is sparsely or partially replacing standard convolutional or transformer blocks with Mamba-based Visual State Space (VSS) modules, exploiting their linear-time global context modeling and memory footprint advantages to achieve state-of-the-art performance, especially in regimes of scarce supervision.

1. Fundamental Concepts and Design Rationale

Semi-Mamba-UNet architectures arise from the intersection of state-space modeling (as synthesized in the Mamba architecture), the U-Net encoder–decoder backbone, and recent advances in semi-supervised and self-supervised learning. The core motivation is to overcome fundamental deficits of both CNNs (limited local receptive fields) and ViTs (quadratic attention complexity) in modeling medical or high-resolution visual data, where both local detail and global context must be recovered at high accuracy and efficiency (Liu et al., 2024, Ma et al., 2024).

Mamba’s contribution centers on its SSM-based block (“Visual Mamba” or VSS), which performs kernelized linear sequence processing (following recurrence $h_{t+1} = A\,h_t + B\,x_t$ , $y_t = C\,h_t + D\,x_t$ ), parameterized for input-dependence and computationally specialized for 2D or 3D scanning via selective scan or multidirectional flattening (Lumetti et al., 2024, Wang et al., 2024). This enables efficient, global context propagation across arbitrarily long pixel or voxel sequences.

2. Architectures: Encoder–Decoder Hybridization

The Semi-Mamba-UNet umbrella encompasses a wide spectrum, from fully Mamba-based to hybrid designs fusing CNN and SSM representations.

Fully VSS Mamba-UNet: Both encoder and decoder stages are constructed from stacks of VSS blocks. Patch merging/expansion down- and up-sample features, while skip connections are fused via additional VSS units to propagate multi-scale context (Wang et al., 2024).
Hybrid Semi-Mamba-UNet: Only specific stages (e.g., mid-to-deep encoder blocks or decoder blocks) utilize VSS blocks; the others remain convolutional. Adaptation/fusion is performed via 1×1 convolutions or linear projections, as in ACM-UNet (Huang et al., 30 May 2025).
Parallel Models for Cross-Supervision: Distinct UNet branches—one Mamba-based (VSS), the other CNN—are trained jointly and coupled only through loss-level regularization, enabling robust cross-pseudolabeling and feature-level consistency (Ma et al., 2024).

Distinct architectural instantiations include:

3D medical segmentation: Unidirectional or bidirectional Mamba-Layer is inserted before each encoder downsampling, preserving the classic UNet 3D structure while promoting sequence-based global context (Lumetti et al., 2024).
Remote sensing: Only decoder path employs VSS blocks for efficiency; a lightweight CNN or vision transformer encoder extracts local features, and a Local Supervision Module (LSM) regularizes decoder representations (Zhu et al., 2024).

3. Semi-Supervised and Self-Supervised Training Strategies

Semi-Mamba-UNet implementations universally target regimes with abundant unlabeled data or limited annotations, leveraging multiple learning strategies:

Teacher–Student/EMA: Networks are duplicated into “student” and exponentially-averaged “teacher” variants. The student is trained conventionally, while the teacher provides stable targets for unlabeled data via pseudo-labels. Consistency is enforced at the logit or intermediate feature level, with confidence thresholding on teacher predictions (e.g., max-probability ≥ 0.95) (Liu et al., 2024, Wang et al., 2024).
Cross-Pseudo-Labeling and Consistency: When hybrid/parallel backbones are used, each sub-network generates hard pseudo-labels for unlabeled images, training the other to match, with loss given by cross-entropy and Dice on these cross-pseudo labels (Ma et al., 2024).
Pixel-wise Contrastive/Consistency Loss: Feature-level contrastive or MSE loss aligns the representations of paired backbones (e.g., VMamba and CNN UNet) on both labeled and unlabeled data, typically via Frobenius-norm over the token/feature grid (Ma et al., 2024).
Multi-Scale Deep Supervision: Auxiliary supervision is applied at outputs of various decoder stages, increasing the robustness and error signal propagation during semi-supervised training (Liu et al., 2024, Zhu et al., 2024).
Self-Prior Perturbation (Super-resolution): For image restoration, self-prior is generated by perturbing a randomly selected subregion in the input (e.g., brightness inpainting), forcing the network to inpaint and, thereby, learn fine-grained texture and intensity priors (Ji et al., 2024).

4. Core Mathematical Formulations and Processing Mechanics

The backbone of each Semi-Mamba-UNet is the Visual Mamba or state-space block. Given input tokens/features $x_t$ , the update proceeds as:

$\begin{aligned} h_{t+1} &= A h_t + B x_t \ y_t &= C h_t + D x_t \end{aligned}$

where $A, B, C, D$ are learnable matrices (typically diagonal or low-rank), and optionally input-dependent per time step (Selective S6).

For 2D/3D imaging, VSS blocks implement 2D/3D selective scan (SS2D/ISS2D) by flattening along multiple spatial axes/directions (e.g., left-right, right-left, top-bottom, bottom-top), invoking the state-space recurrence for each direction, and then aggregating outputs via a learned fusion/gating network to form the final output feature map (Ji et al., 2024, Lumetti et al., 2024).

Supervised segmentation loss combines per-pixel cross-entropy and Dice: $L_{\mathrm{sup}} = \mathrm{CE}(p_\theta(x), y) + \lambda_{\mathrm{Dice}} (1-\mathrm{Dice}(p_\theta(x),y))$ with feature- or logit-consistency, cross-pseudo-label, self-prior, or perceptual losses optionally included, depending on the domain and supervision regime.

5. Empirical Performance and Ablation Results

Across diverse applications and datasets, Semi-Mamba-UNet models consistently demonstrate superior or competitive accuracy, efficiency, and memory trade-offs relative to both classical CNN-based UNets and transformer-based approaches.

Medical Image Segmentation: On datasets such as ACDC and Synapse, fully or semi-Mamba-UNet designs achieve higher Dice scores and improved boundary metrics over CNN and ViT baselines; e.g., Dice = 0.9281 (ACDC) and Dice = 0.6429 (Synapse) with a low parameter count and FLOPs (Wang et al., 2024).
Remote Sensing: UNetMamba delivers mIoU gains of 0.87% (LoveDA) and 0.39% (ISPRS Vaihingen) over UNetFormer while reducing parameter count and memory footprint (Zhu et al., 2024).
Semi-Supervised Regimes: With 5–10% labeled data, cross-supervised Semi-Mamba-UNet yields Dice = 0.8386 (5%) and 0.9114 (10%) on ACDC, exceeding seven state-of-the-art semi-supervised baselines (Ma et al., 2024).
Super-Resolution: SMamba-UNet achieves the highest PSNR/SSIM across IXI and fastMRI compared to CNN and transformer competitors, demonstrating sharper boundaries and texture recovery (Ji et al., 2024).
3D Applications: SegMamba boosts Dice by up to 2–3 percentage points and reduces HD95 in brain tumor and abdominal tasks, with multi-directional variants further improving fine structure segmentation at the cost of additional compute (Lumetti et al., 2024).

Ablation studies show that selective insertion of VSS blocks (mid-to-deep layers) offers the best trade-off between accuracy and computation. Multi-scale supervision, local supervision modules (for detail recovery), and feature-consistency penalties each yield measurable gains in both quantitative and qualitative metrics.

6. Domain-Specific Extensions and Variants

Several notable variants tailor the Semi-Mamba-UNet principle to specific modalities or tasks:

ACM-UNet: Integrates pretrained CNNs at shallow encoder levels with deep Mamba/VSS blocks via a lightweight adapter and wavelet-based multi-scale decoder refinement. The “semi” variant partially replaces CNN blocks with VSS in the encoder, resulting in marginal accuracy decreases but improved computational efficiency (Huang et al., 30 May 2025).
UNetMamba: For remote sensing, only the decoder path employs VSS blocks, enabling efficient high-resolution segmentation with minimal extra overhead. The local supervision module further boosts fine-grained accuracy without runtime penalty (Zhu et al., 2024).
SMamba-UNet: Merges improved 2D-Selective-Scan (ISS2D) blocks with self-prior brightness perturbation for super-resolution, tightly coupling global context modeling with enhanced local detail synthesis (Ji et al., 2024).
SegMamba/SegMambaSkip: For 3D segmentation, Mamba-Layers are placed before downsampling/upsampling, or even replace skip connection convolutions, balancing memory and boundary accuracy (Lumetti et al., 2024).

7. Practical Considerations, Limitations, and Implications

Semi-Mamba-UNet models consistently demonstrate robust semi-supervised and self-supervised learning, high parameter- and FLOP-efficiency, and improved generalization under data scarcity. However, they introduce several challenges:

Training stability on noisy pseudo-labels or unlabeled data requires careful loss ramp-up, gating, and confidence thresholding.
The computational cost of multi-directional or deep VSS block insertion can be substantial, particularly in 3D applications.
The benefit of SSM-based blocks may be more pronounced in tasks demanding long-range contextual integration, rather than uniform grid sampling or pure local analysis.

A plausible implication is that as the availability of labeled data continues to bottleneck medical and Earth observation AI, the Semi-Mamba-UNet paradigm is likely to proliferate, either as pure or hybrid SSM backbones or as regularization/consistency modules, especially in cross-domain, multi-modal, or federated learning environments.

References:

(Liu et al., 2024) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining (Ma et al., 2024) Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet (Wang et al., 2024) Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation (Huang et al., 30 May 2025) ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation (Zhu et al., 2024) UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images (Lumetti et al., 2024) Taming Mambas for Voxel Level 3D Medical Image Segmentation (Ji et al., 2024) Self-Prior Guided Mamba-UNet Networks for Medical Image Super-Resolution