Hybrid CNN-Mamba UNet

Updated 1 April 2026

The paper introduces a hybrid CNN-Mamba UNet that fuses CNN operators for local detail with Mamba SSM modules for efficient modeling of long-range dependencies.
It employs an encoder–decoder architecture with advanced fusion and skip connection techniques to maintain both high-frequency details and global contextual awareness.
Empirical evaluations on medical imaging benchmarks reveal state-of-the-art segmentation accuracy, improved mIoU/DSC metrics, and reduced computational costs.

A hybrid CNN-Mamba UNet refers to a class of encoder–decoder architectures that merge convolutional neural network (CNN) operators for local feature extraction with Mamba-based State Space Model (SSM) modules to capture long-range dependencies in imaging data. These architectures have become prominent in medical image segmentation and related spatial modeling tasks, combining the strengths of local inductive bias from CNNs with the efficient global context acquisition, memory, and linear complexity scaling of SSMs (notably, the Mamba model). Approaches recognized under this term include VM-UNetV2, MM-UNet, MS-UMamba, HyM-UNet, ACM-UNet, and others across diverse scientific domains (Zhang et al., 2024, Xie et al., 21 Mar 2025, Xu et al., 14 Jun 2025, Chen et al., 22 Nov 2025, Huang et al., 30 May 2025).

1. Theoretical Foundations and Motivation

CNNs are effective for local feature extraction due to their spatially local receptive field, translation equivariance, and efficient implementation. However, clinical and scientific imaging tasks often demand modeling semantic information distributed over extended spatial ranges, where pure CNNs exhibit limited performance due to their inherently local connections and slowly-growing receptive field (Zhang et al., 2024). Transformer architectures address this by global self-attention, but incur $\mathcal{O}(N^2)$ time and memory in pixel count $N$ , limiting scalability.

State Space Models, and in particular the Mamba architecture, offer an alternative: global context is modeled by structured linear recurrences (SSMs) with $\mathcal{O}(N)$ or $\mathcal{O}(N\log N)$ complexity, implemented as either global convolution with a long kernel or an efficient parallel scan (Zhang et al., 2024, Xie et al., 21 Mar 2025). Mamba augments classical SSMs by input-dependent “gating” in the state-evolution and output mappings, further enhancing modeling power while preserving efficiency.

The hybridization paradigm is thus motivated by:

CNN path: strong at local spatial pattern learning, efficient at low-level structure.
Mamba/SSM path: strong at global, long-range dependency modeling with linear cost.
Hybrid fusion: enables the network to capture both spatially-local and semantically-global features, yielding state-of-the-art accuracy in boundary-sensitive or context-dependent segmentation tasks.

2. Canonical Architectural Patterns

2.1 Encoder–Decoder Skeleton

Most hybrid CNN-Mamba UNets follow the classical U-Net blueprint: a multi-stage, downsampling encoder, a bottleneck, and a multi-stage upsampling decoder, with skip connections joining encoder and decoder layers at matching spatial scales (Zhang et al., 2024, Huang et al., 30 May 2025). Several key design variants are observed:

Block-level hybridization: Each block in the encoder/decoder consists of parallel or sequential CNN and Mamba branches, outputs are fused by additive, multiplicative, or concatenation schemes (Xie et al., 21 Mar 2025, Xu et al., 14 Jun 2025, Chen et al., 22 Nov 2025).
Stage-level hybridization: Shallow encoder stages are pure CNNs for high-frequency detail; deeper stages use VSS (Visual State Space) blocks for global context (Chen et al., 22 Nov 2025, Zhang et al., 2024).
Adapter-based fusion: Lightweight adapters map CNN channel dimensions to those required by Mamba blocks and vice versa (Huang et al., 30 May 2025).
Multi-branch or dual-branch: Some architectures run CNN and Mamba-based encoder–decoder paths in parallel and merge their outputs through evidence-guided consistency, prompt fusion, or multi-focus attention modules (Zhang et al., 25 Mar 2025, Han et al., 2024).

2.2 Core Module: Visual State Space (Mamba) Block

The Visual State Space (VSS) Block is the principal operator for global feature modeling via SSM (Zhang et al., 2024). It is mathematically specified as:

Continuous-time dynamics:

$h'(t) = A\,h(t) + B\,u(t),\quad y(t) = C\,h(t) + D\,u(t)$

Discretization (zero-order hold, step $\Delta$ ):

$\bar{A} = \exp(\Delta A),\quad \bar{B} = (A^{-1}(\bar{A}-I)) B$

$h[t] = \bar{A} h[t-1] + \bar{B} u[t],\quad y[t] = C h[t]$

For 2D feature maps, the selective-scan (SS2D or ISS2D) process runs state-space recurrences along rows and/or columns, often in multiple directions. Many variants alternate spatial axes, combine multi-diagonal passes, or fuse directionally-weighted outputs (Zhang et al., 2024, Ji et al., 2024).
Hybridization: input first passes through a 1×1 or depthwise 3×3 convolution for embedding, then is split between parallel local (CNN) and SSM (Mamba) branches. Outputs are typically fused via element-wise sum or product, followed by normalization and activation (e.g., SiLU) (Zhang et al., 2024, Chen et al., 22 Nov 2025, Xu et al., 14 Jun 2025).

2.3 Skip Connections and Fusion Modules

Hybrid CNN-Mamba UNets rarely use plain skip concatenation. Instead, skip features are fused using advanced attention-based modules to better integrate multi-scale semantic and boundary information:

Semantics and Detail Infusion (SDI): Fuses all encoder levels into each decoder stage using CBAM attention, channel alignment, size-matching, smoothing, and a final learned fusion (Zhang et al., 2024).
Attention-Based Dynamic Feature Fusion (ADFF): Parallel spatial and channel attention followed by fusion, applied at each encoder–decoder merge point (Xu et al., 14 Jun 2025).
Mamba-Guided Fusion Skip (MGF-Skip): Decoder features gate encoder features via convolutional activations, suppressing background and enhancing ambiguous or noisy boundaries (Chen et al., 22 Nov 2025).
Multi-Scale or Multi-Focus Attention: Merges local and global paths with channel attention, frequently integrating both average and max-pooled statistics for robust selection (Zhang et al., 25 Mar 2025, Liu et al., 2024).

3. Mathematical Properties and Computational Complexity

Hybrid CNN-Mamba UNets leverage the following computational advantages:

SSM/Mamba Layer: Linear complexity $\mathcal{O}(L \cdot N)$ , $L$ = sequence length (e.g., $N$ 0), $N$ 1 = state dimension. No quadratic cost unlike self-attention (Zhang et al., 2024, Xie et al., 21 Mar 2025, Ji et al., 2024).
Local Convolutions: $N$ 2 for $N$ 3 kernels; receptive field accumulates slowly as depth increases.
End-to-End Models: Hybrid models often have substantially fewer parameters and FLOPs than Transformer- or attention-rich variants. For instance, VM-UNetV2: 17.9M params, 4.4 GFLOPs, 32.6 FPS at 3×256×256 input (Zhang et al., 2024). Pure SSM variants can go even lower (LightM-UNet: as little as 1.1M params (Liao et al., 2024)), but at the cost of local detail if CNN components are not present.

Ablation studies across models (VM-UNetV2, HyM-UNet, PGM-UNet) consistently show that the addition of SSM/Mamba blocks to deep or bottleneck layers yields significant gains in global context modeling (IoU, DSC increases of 1–2.5%), while preserving CNNs in early or shallow layers maintains local segmentation detail (Zhang et al., 2024, Chen et al., 22 Nov 2025, Xie et al., 21 Mar 2025).

4. Training Regimens and Empirical Performance

Hybrid CNN-Mamba UNets are trained using multi-term losses, including Dice, binary cross-entropy, focal loss, and auxiliary supervision at intermediate feature maps. Training strategies leverage modern optimizers (AdamW, SGD with momentum), cosine or polynomial learning rate schedules, and extensive geometric and photometric data augmentation (Zhang et al., 2024, Chen et al., 22 Nov 2025, Xu et al., 14 Jun 2025).

Comparative results on medical segmentation benchmarks (ISIC17/18, Synapse, AMOS2022, Kvasir-SEG, ClinicDB, ColonDB, ETIS, CVC-300, ACDC) show that hybrid architectures consistently outperform or match state-of-the-art CNN, Transformer, and prior SSM models, often with fewer parameters and lower inference latency. For example, VM-UNetV2 achieves 82.34% mIoU/90.31% DSC on ISIC17, and 84.15%/91.34% on Kvasir-SEG (Zhang et al., 2024), while MM-UNet attains 91.0% Dice on AMOS2022 CT (vs. 87.8% for nnUNet) (Xie et al., 21 Mar 2025). HyM-UNet improved ISIC2018 DSC to 88.97% with just 8.2M parameters (Chen et al., 22 Nov 2025).

Numerous derivatives of the hybrid CNN-Mamba UNet paradigm have appeared, reflecting diverse strategies:

Adapters for CNN/VMamba backbone plug-in: ACM-UNet employs lightweight adapters to resolve channel mismatches and combines ResNet with VMamba SSMs, further sharpening features using wavelet transforms in the decoder (Huang et al., 30 May 2025).
Multi-view and axial attention integration: HCMA-UNet introduces a MISM block splitting feature channels across orthogonal planes and fuses VSSB with axial self-attention (Li et al., 1 Jan 2025).
Prompt-guided and evidence-based dual-branch systems: PGM-UNet uses a prompt-guided residual Mamba module, dynamically adjusting SSM behavior based on input image prompts; MambaEviScrib fuses two full encoder–decoder paths (CNN and Mamba) mediated by evidence-guided consistency and scribble-supervision (Zhang et al., 25 Mar 2025, Han et al., 2024).
Multi-scale and cross-modality applications: MS-UMamba incorporates MCAT bottlenecks for convolutional branches, cross-attention in SSMs, and multi-scale feature fusion for challenging modalities like fetal ultrasound (Xu et al., 14 Jun 2025). CM-UNet adapts the framework for large remote sensing images, using channel- and spatial-attention-gated SSMs and MSAA modules (Liu et al., 2024).
Super-resolution and synthesis: SMamba-UNet adapts the hybrid approach for MR image super-resolution, incorporating improved ISS2D modules and internal self-prior inpainting for learning detailed texture (Ji et al., 2024).

6. Limitations, Open Problems, and Future Directions

Despite proven performance, hybrid CNN-Mamba UNets pose several technical challenges and future opportunities:

SSM hyperparameters: The choice of state dimension, gating networks, and scan axis weighting requires empirical tuning; performance can degrade if the SSM capacity is suboptimal for a given dataset or spatial context (Chen et al., 22 Nov 2025).
Complexity of fusing multiple modalities: Multi-branch and prompt-guided architectures increase implementation and computational complexity, requiring careful balancing of local/global paths in both training and inference (Zhang et al., 25 Mar 2025).
Scalability and hardware deployment: While SSMs are efficient asymptotically, implementations of selective 2D scan, adapters, and attention modules must be optimized for target hardware to realize theoretical speedups (Liao et al., 2024, Chen et al., 22 Nov 2025).
Boundary and fine detail: Overuse of SSMs at shallow layers can blunt fine edges and texture; hybridization must preserve shallow CNN processing for high-frequency content (Zhang et al., 2024).
Generalization: Domain adaptation, self-configuring topologies, and low-data regimes require further study, particularly for cross-modality and weakly supervised cases (Ma et al., 2024, Han et al., 2024).

Future research will likely address multi-scale SSM integration, adaptive scan weighting, domain transfer, and quantized/lightweight models for resource-constrained settings (Chen et al., 22 Nov 2025, Zhang et al., 2024).

7. Summary Table of Representative Architectures

Below is a summary categorizing prominent hybrid CNN-Mamba UNet architectures:

Architecture	Hybridization Strategy	Key Modules	Main Benchmarks
VM-UNetV2	CNN + VSS block in encoder/decoder	VSS, SDI	ISIC, Kvasir, ClinicDB
MM-UNet	CNN layers + Mamba in residual	MetaSSM, BiScan	AMOS2022, Synapse
HyM-UNet	CNN (shallow) + VSS (deep)	MGF-Skip	ISIC2018
ACM-UNet	ResNet + VMamba via adapters	VSS, MSWT decoder	Synapse, ACDC
MS-UMamba	Split branch, MCAT, VSS, ADFF	MCAT, ADFF	Fetal US (private)
PGM-UNet	Dual-path LIEM+PGRM	Prompt-guided Mamba, MAFM	ISIC, DRIVE, DIAS
HCMA-UNet	ResBlock + MISM (VSSB+ASA)	ASC, FRLoss	DCE-MRI, public/private
CM-UNet	CNN encoder + gated Mamba decoder	CSMamba, MSAA	ISPRS, LoveDA