Mamba-Based UNet: Efficient Global Segmentation

Updated 29 May 2026

The paper demonstrates how Mamba-based UNet integrates state-space models into a U-Net topology to achieve efficient long-range dependency modeling and superior segmentation performance.
Mamba-based UNet is a deep learning framework that fuses visual SSM modules with multi-scale aggregation to preserve global context while retaining local details.
Practical applications in medical imaging showcase notable improvements in Dice scores and FLOPs reduction compared to CNN and Transformer baselines.

Mamba-Based UNet

A Mamba-based UNet is a deep learning framework that integrates state-space models (SSMs)—most notably the Mamba architecture—into the classical U-Net encoder–decoder topology, aiming to achieve efficient long-range dependency modeling with linear computational complexity. By replacing or augmenting standard convolutional (CNN) or self-attention (ViT) blocks with Mamba’s visual, vision-adapted SSM modules, Mamba-based UNet models enable superior global context modeling while maintaining the strong multi-scale and locality-preserving characteristics of U-Net. Recent advances have extended these frameworks with multi-scale convolution, directional SSMs, and custom upsampling/aggregation, resulting in state-of-the-art performance across diverse segmentation and dense prediction tasks in medical imaging and beyond (Chen et al., 2024).

1. Architectural Foundations and Model Variants

The canonical Mamba-based UNet adopts a U-shape, encoder–decoder structure, with skip connections at each scale. The encoder typically consists of a vision Mamba backbone (e.g., VMamba V2, pretrained on ImageNet-1k), partitioned into four hierarchical stages:

Stage 1 uses patch embedding (e.g., 4×4 non-overlapping) and a single VSS (Visual State-Space) block.
Stages 2–4 employ patch-merging for downsampling and VSS/MSVSS (Multi-Scale Visual State-Space) blocks.

The decoder mirrors this hierarchy, with each upsampling stage comprising:

Large Kernel Patch Expanding (LKPE) or equivalent upsample layer.
Feature fusion of the decoder output with encoder skip connections, often via custom MSVSS or MS-FFN modules.
Final projection to image resolution and class logits via a last upsampling and classification layer.

Underlying this scaffold, the Mamba SSM is used to model long-range dependencies through block-wise processing (directional scan followed by SSM), typically following the VMamba SS2D scheme: flattening features along multiple scan orders (top-bottom, bottom-top, left-right, right-left), applying a 1D Mamba block to each sequence, then recomposing 2D features by summing the outputs.

Variants exist, including pure Mamba-based (e.g., LightM-UNet replaces all blocks with Mamba modules for extreme efficiency (Liao et al., 2024)), hybrid CNN+Mamba (e.g., ACM-UNet uses CNNs for local features and Mamba VSS for global with an inter-stage adapter (Huang et al., 30 May 2025)), and attention-fused hybrids (e.g., ASP-VMUNet, UltraLBM-UNet, PGM-UNet).

2. Mathematical Formulation of State Space and 2D Adaptation

The mathematical foundation of Mamba-based blocks is the discrete-time, linear state-space model: $x(t) = A\,x(t-1) + B\,u(t) \qquad y(t) = C\,x(t) + D\,u(t)$ where $x(t) \in \mathbb{R}^n$ is the hidden state, $u(t) \in \mathbb{R}^d$ is the input, and $y(t) \in \mathbb{R}^k$ is the output. The learnable matrices $A,B,C,D$ are typically parameterized for fast scan/inference and hardware efficiency, yielding O(L+n) complexity for sequence length L.

For vision (2D data), the Mamba block is applied after flattening the spatial map along four orthogonal scan orders (SS2D), producing direction-sensitive representations. Some advanced designs supplement this with multi-scale and diagonal aggregation via depth-wise convolutions (e.g., MS-FFN in MSVSS blocks), ensuring both axis-aligned and non-axis-aligned (diagonal) dependencies are effectively modeled (Chen et al., 2024).

In pure SSM blocks, a typical flow is:

Linear projection (channel expansion)
Depthwise convolution for locality
Nonlinearity (e.g., SiLU or GELU)
Directional SS2D SSM
LayerNorm and linear projection
Residual connection

3. Multi-Scale Aggregation and Enhanced Feature Learning

A primary innovation in recent Mamba-based UNet architectures, such as MSVM-UNet, is the explicit design of multi-scale aggregation within each block:

Multi-Scale Feed-Forward Networks (MS-FFN): After SSM/four-directional scan, features are passed through channel-expansion, then multiple depth-wise convolutions with varying kernel sizes (e.g., 1×1, 3×3, 5×5). Their summed outputs, together with a residual, are projected back to the base channel dimension.
Purpose: This module aggregates spatial context at multiple scales, including diagonal and off-axis dependencies, which are poorly captured by one-dimensional SSM scanning alone.

This principle is echoed in other SSM-UNet hybrids, where multi-branch processing, channel attention, and local-global mixing are vital for matching or exceeding CNN/Transformer baselines—especially for small, low-contrast, or ambiguous structures (Chen et al., 2024, Xu et al., 14 Jun 2025, Rahman et al., 22 Apr 2026).

4. Upsampling, Decoder Design, and Skip Connections

To maximize high-resolution detail retention, Mamba-based UNets typically introduce skip connections from encoder to decoder at matching scales. In decoders, specialized upsampling layers account for both channel-wise and spatial context:

Large Kernel Patch Expanding (LKPE): Achieves up-resolving by combining 1×1 convolution (for channel mixing), batch/layer normalization, nonlinearity, and 3×3 depthwise convolution, followed by rearrangement and normalization. This process maintains the structural integrity of feature maps, avoiding the loss of fine details common with naive reshaping or patch-expand operations (Chen et al., 2024).
Skip Fusion: Advanced variants dynamically combine skip features and upsampled decoder outputs via attention or additive/concatenative fusion, often followed by further SSM or multi-scale convolutional processing—leading to robust hierarchical feature integration (Huang et al., 30 May 2025, Xu et al., 14 Jun 2025).

5. Computational Complexity and Implementation Considerations

The prevailing advantage of the Mamba-based UNet is linear computational complexity with respect to the number of spatial positions or sequence elements, a substantial reduction over the quadratic cost of self-attention used in ViT-based UNets. For instance:

Parameter Efficiency: Pure Mamba UNet variants (e.g., LightM-UNet) achieve up to 116× reduction in parameter count versus classical CNN/Transformer UNet baselines, with competitive or superior segmentation accuracy (Liao et al., 2024).
FLOPs Reduction: Similar factors apply to FLOP count (e.g., 21× fewer for LightM-UNet compared to nnU-Net at 2D/3D scale).
Pragmatic Aspects: Models are typically trained using AdamW or SGD (lr ∼1e-3–5e-4), combined Dice+cross-entropy loss, and strong augmentation. Early stopping, batch/layer norm stabilization, deep supervision, and hybrid upsampling are common.

Model scaling is effective up to moderate depths/widths; over-scaling without extra data can lead to diminishing returns or degraded accuracy (Jiang et al., 2024). Adaptive modules (e.g., knowledge distillation, attention-guided fusion) support deployment in ultra-lightweight, resource-constrained settings (Fan et al., 25 Dec 2025, Rahman et al., 22 Apr 2026).

6. Benchmarks, Applications, and Comparative Performance

Mamba-based UNets have demonstrated state-of-the-art results across several domains:

Medical Image Segmentation: MSVM-UNet achieves 85.00% Dice and 14.75 mm HD95 on Synapse (8-organ CT), outperforming previous Mamba baselines by +2.62% Dice and –1.47mm HD95 (Chen et al., 2024). On ACDC (cardiac MRI), it attains 92.58% Dice (RV 91.00%, Myo 90.35%, LV 96.39%).
Fetal Ultrasound, Remote Sensing, and Skin Lesions: Hybrid variants (e.g., SS-MCAT-SSM in MS-UMamba, PVM in ASP-VMUNet, multi-branch/attention in MambaLiteUNet) yield robust performance in challenging low-contrast and domain-variant settings (Xu et al., 14 Jun 2025, Bao et al., 25 Mar 2025, Rahman et al., 22 Apr 2026).
Generic Vision and Speech: Extensions to speech separation and enhancement, super-resolution, and high-resolution remote sensing confirm the architecture’s generality (Dang et al., 2024, Wang et al., 2024, Ji et al., 2024, Zhu et al., 2024).
Comparison to Baselines: Across datasets and domains, Mamba-based UNets compete with or outperform CNN, Transformer, and hybrid models, especially when efficiency and global–local fusion are paramount.

A summary of representative results for MSVM-UNet (Chen et al., 2024):

Model	Synapse DSC (%)	Synapse HD95 (mm)	ACDC DSC (%)	Params (M)	FLOPs (G)
MSVM-UNet	85.00	14.75	92.58	35.93	15.53
VM-UNet	82.38	16.22	–	22.77	4.40
D-LKA Net	80.89	22.31	–	–	–
PVT-EMCAD-B2	81.19	15.36	–	–	–

7. Limitations and Future Directions

Mamba-based UNet models retain several open challenges and technical limitations:

Directional Sensitivity: Standard 1D scan SSMs cannot naturally capture non-orthogonal (diagonal, multi-scale) dependencies in 2D/3D. This is partially mitigated through enhanced MSVSS blocks and auxiliary convolution, but perfect 2D/3D context remains elusive (Chen et al., 2024).
Efficiency–Accuracy Trade-off: Ultra-light variants may lag in representing complex or highly variable structures. Hybridization with local pattern modules (CNN, wavelet, attention) is often necessary for top-tier segmentation fidelity in edge-device deployments (Fan et al., 25 Dec 2025, Huang et al., 30 May 2025).
State Space Hyperparameterization: The choice of SSM state dimension, number of scan directions, and block depth impacts both convergence and stability. Larger models may suffer from overfitting or diminished return without sufficient dataset size or additional pretraining.

Future improvements are likely to center on more flexible 2D/3D SSMs, self-supervised pretraining strategies, efficient multi-modal adaptation, and dynamic or adaptive scan/fusion methods tailored to specific application domains.

References:

MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation (Chen et al., 2024)
LightM-UNet: Mamba Assists in Lightweight UNet for Medical Image Segmentation (Liao et al., 2024)
MS-UMamba: An Improved Vision Mamba Unet for Fetal Abdominal Medical Image Segmentation (Xu et al., 14 Jun 2025)
MLLA-UNet: Mamba-like Linear Attention UNet (Jiang et al., 2024)
MM-UNet: Meta Mamba UNet for Medical Image Segmentation (Xie et al., 21 Mar 2025)
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining (Liu et al., 2024)
ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation (Bao et al., 25 Mar 2025)
UltraLBM-UNet: Ultralight Bidirectional Mamba-based Model for Skin Lesion Segmentation (Fan et al., 25 Dec 2025)
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation (Rahman et al., 22 Apr 2026)
ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation (Huang et al., 30 May 2025)
Self-Prior Guided Mamba-UNet Networks for Medical Image Super-Resolution (Ji et al., 2024)