Hybrid Mamba-Transformer UNet

Updated 1 April 2026

Hybrid Mamba-Transformer UNet is a neural architecture that combines state space models, Transformer self-attention, and convolution within a U-Net design for robust visual analysis.
It employs parallel and sequential fusion mechanisms, enabling efficient long-range dependency modeling and fine spatial detail extraction in applications like medical segmentation and remote sensing.
Empirical results demonstrate substantial parameter and FLOPs reduction while achieving competitive or superior performance compared to pure CNN, Transformer, or SSM-only models.

Hybrid Mamba-Transformer UNet refers to a class of neural network architectures that integrate State Space Models (specifically, variants of the Mamba operator), Transformer-style self-attention, and classic U-Net encoder–decoder topologies for high-fidelity, efficient visual processing. These architectures leverage the strengths of each component: linear-complexity global modeling via Mamba, fine spatial selectivity via convolution, and local or windowed attention for nuanced dependencies. Hybrid Mamba-Transformer UNets have been advanced across diverse domains, including 2D and 3D medical segmentation, remote sensing, and trajectory prediction, often exceeding the performance and efficiency of pure CNN, Transformer, or SSM-only variants (Zhang et al., 2024, Jia et al., 28 Jul 2025, Chen et al., 22 Nov 2025, Wang et al., 24 Jul 2025, Li et al., 1 Jan 2025, Cao et al., 2024, Liu et al., 2024).

1. Architectural Foundations

Hybrid Mamba-Transformer UNets are generally based on U-shaped encoder–decoder networks with skip connections, where convolutional stages emphasize local feature extraction in shallow layers, and Mamba SSM and/or Transformer self-attention dominate deeper layers to capture long-range dependencies.

In a representative architecture such as HMT-UNet (Zhang et al., 2024):

Input: $H\times W\times3$ image; “stem” of two $3\times3$ conv layers (stride 2) produces feature maps of size $\frac{H}{4}\times\frac{W}{4}\times C$ .
Encoder: four stages.
- Stages 1–2: pure CNN, each reducing spatial resolution by 2.
- Stages 3–4: stacked “MambaVision Mixer” (hybrid) blocks; if the number of layers per stage is $N$ , the first $N/2$ apply Mamba SSMs, the remainder use windowed Multi-Head Self-Attention (MHSA), typically $7\times7$ or $8\times8$ windows for linear cost.
Decoder: symmetric upsampling; deep decoder stages apply MambaVision then Transformer, shallow decoder stages are pure convolution.
Skip connections: element-wise addition of encoder and decoder features at corresponding stages.
Output: final up-projection to full spatial resolution.

Related designs such as RadioMamba (Jia et al., 28 Jul 2025), HyM-UNet (Chen et al., 22 Nov 2025), HybridTM (Wang et al., 24 Jul 2025), and CM-UNet (Liu et al., 2024) apply these motifs to segmentation of radio maps, 3D data, and remote sensing images, with modifications in the fusion and attention strategies.

2. State Space Model (Mamba) Integration

The Mamba component is a continuous-discrete State Space Model (SSM), employed to model long-range, sequential dependencies efficiently. Its core can be formalized as:

Continuous SSM:

$\frac{d\mathbf{h}(t)}{dt} = A\,\mathbf{h}(t) + B\,\mathbf{x}(t),\quad \mathbf{y}(t) = C\,\mathbf{h}(t)$

with learned parameters $A \in \mathbb{R}^{N\times N}$ , $B, C$ .

Discretized via Zero-Order Hold:

$3\times3$ 0

Updating as:

$3\times3$ 1

In vision practice, state propagation is applied in raster-scan or multiple diagonal scans (e.g., four directions in HyM-UNet), enabling $3\times3$ 2 complexity across $3\times3$ 3 tokens/pixels.

Architectures such as RadioMamba (Jia et al., 28 Jul 2025) adapt Mamba to 2D by flattening spatial features and employing bidirectional scans with parallel convolution. Other variants (e.g., HybridTM (Wang et al., 24 Jul 2025)) employ large-window BiMamba blocks in the hybrid fusion.

3. Hybrid Fusion Mechanisms

The interaction between SSM, convolution, and self-attention is typically realized by:

Parallel branches: In HMT-UNet’s “MambaVision Mixer,” input features are split into two branches:
- SSM branch: Conv1D (via Linear projection to $3\times3$ 4), SiLU, Mamba scan;
- Convolution branch: Conv1D (Linear to $3\times3$ 5), SiLU;
- Outputs are concatenated and projected back to $3\times3$ 6 channels.
Sequential fusion: HybridTM (Wang et al., 24 Jul 2025) implements an "Inner-Layer Hybrid Strategy," partitioning features into small groups for windowed attention and large groups for (Bi)Mamba, interleaved and fused with FFN.
Attention gating: CM-UNet (Liu et al., 2024) deploys a CSMamba block, in which Mamba outputs are modulated by both channel and spatial gates derived from convolutional attention.
MGF-skip connections: HyM-UNet (Chen et al., 22 Nov 2025) uses decoder features as gating signals over encoder features to suppress background noise before fusion.

Self-attention modules generally use windowed MHSA for tractable complexity, following Swin-style local attention. In HybridTM, attention is computed on small spatial groups of voxels/tokens, while Mamba is computed over larger groups, supporting spatial scalability.

4. Computational Complexity and Parameter Efficiency

A critical advantage of Hybrid Mamba-Transformer UNet designs lies in scalability:

Mamba/SSM modules provide $3\times3$ 7 cost per layer, contrasting with quadratic $3\times3$ 8 cost for full attention. Local window attention via MHSA on patches further preserves linear scaling.
By confining attention and SSM to specific stages or merging branches, these models can achieve parameter and FLOPs reductions of 80–90% versus transformer-only UNets, as evidenced empirically (e.g., 8.6 M params and 28 ms inference in RadioMamba vs. 297.7 M and 553 ms for a diffusion-based baseline (Jia et al., 28 Jul 2025); 47.9 M params vs. 255.1 M in the Mamba Policy (Cao et al., 2024)).
In CM-UNet (Liu et al., 2024), channel/spatial gating and attention augmentation ensure competitive memory and runtime (6.01 G FLOPs, 12.9 M params for $3\times3$ 9 inputs).
For 3D processing, HybridTM maintains per-layer time complexity $\frac{H}{4}\times\frac{W}{4}\times C$ 0 vs. standard attention’s $\frac{H}{4}\times\frac{W}{4}\times C$ 1, where $\frac{H}{4}\times\frac{W}{4}\times C$ 2 (attn window) $\frac{H}{4}\times\frac{W}{4}\times C$ 3 (tokens/voxels).

5. Empirical Performance Across Applications

Hybrid Mamba-Transformer UNets achieve strong or state-of-the-art results in diverse domains:

Model	Task	Key Datasets	mIoU	Dice/Other	Params/Latency
HMT-UNet (Zhang et al., 2024)	Medical Segmentation	ISIC17/18, Kvasir, CVC, ETIS	60.44–90.96	DSC: 72.86–95.26%	–
HyM-UNet (Chen et al., 22 Nov 2025)	Med. Seg.	ISIC18	81.82	Dice: 88.97%	Par.-efficient
RadioMamba (Jia et al., 28 Jul 2025)	Radio Map	RadioMapSeer (SRM/DRM)	NMSE: 0.0050	SSIM: 0.9673	8.6 M/28 ms
HybridTM (Wang et al., 24 Jul 2025)	3D Seg.	ScanNet/200, nuScenes, S3DIS	72–80.9	–	SOTA, memory O(NL)
CM-UNet (Liu et al., 2024)	Remote Sensing	Potsdam, Vaihingen, LoveDA	85.48–87.21	mF1 93%	12.9 M/366 MB

Across studies, hybrid models match or outperform prior CNN, Transformer, or pure Mamba baselines, particularly in metrics such as mean Intersection over Union (mIoU), Dice coefficient, and mF1, with marked improvements in runtime and parameter counts.

Ablation studies consistently demonstrate that:

Removing the Mamba/global branch degrades global structure and overall accuracy (e.g., NMSE increases or mIoU drops).
Self-attention or convolution-only variants underperform in long-range interaction, boundary delineation, or local detail (Wang et al., 24 Jul 2025, Jia et al., 28 Jul 2025, Chen et al., 22 Nov 2025).

6. Variations, Limitations, and Future Directions

Several variants tailor the hybrid principle:

HCMA-UNet (Li et al., 1 Jan 2025) introduces a Multi-view Inter-Slice Self-Attention Mamba (MISM) module for efficient tri-directional feature capture in 3D medical data with explicit Asymmetric Split-Channel strategies for anatomical priors and a custom Feature-guided Region-aware Loss (FRLoss).
HyM-UNet (Chen et al., 22 Nov 2025) proposes Mamba-Guided Fusion skips;
CM-UNet (Liu et al., 2024) and RadioMamba (Jia et al., 28 Jul 2025) use SSM–conv branching or gating in segmentation decoders for remote sensing and radio mapping.

Identified limitations include:

Pure Mamba-only layers underperform for fine spatial/detail tasks.
The overhead of multiple 2D/3D scans in SSM blocks poses extra computational cost—though still linear and much lower than quadratic attention.
Most architectures require manual setting of thresholds for transitioning from CNN to SSM/attention blocks; a plausible implication is that adaptive schemes could further improve efficiency.

Potential future directions, as highlighted in the primary sources:

Learning or adapting hybrid boundaries/stage thresholds dynamically (Chen et al., 22 Nov 2025).
Extending these approaches to volumetric/temporal data or to resource-constrained platforms (Li et al., 1 Jan 2025, Cao et al., 2024).
Advanced gating and fusion, e.g., joint channel-spatial gating, topology-aware skip connections, or 3D SSM generalizations.

In summary, Hybrid Mamba-Transformer UNets provide a scalable paradigm for integrating local feature extraction, global context aggregation, and efficient modeling within U-Net-like backbones, establishing performance benchmarks and resource-efficient baselines across complex visual and spatiotemporal recognition tasks (Zhang et al., 2024, Jia et al., 28 Jul 2025, Chen et al., 22 Nov 2025, Wang et al., 24 Jul 2025, Li et al., 1 Jan 2025, Cao et al., 2024, Liu et al., 2024).