HyM-UNet: Hybrid U-Net Architectures
- HyM-UNet is a hybrid framework that integrates Hidden Markov Random Field, CNN-Mamba, and Transformer U-Net variants to deliver robust segmentation performance using hybrid loss functions.
- It employs advanced architectural components like residual convolution blocks and state-space model modules to capture both local textures and long-range semantic dependencies.
- Experimental results demonstrate that HyM-UNet variants consistently outperform standard U-Net models in metrics such as IoU, Dice coefficient, and boundary precision on diverse datasets.
HyM-UNet refers to several recent architectures unifying advanced neural and probabilistic techniques in image segmentation tasks. It primarily covers (1) Hidden Markov Random Field U-Net for unsupervised micro-CT segmentation (Grolig et al., 14 Nov 2025), (2) Hybrid CNN-Mamba UNet for medical imaging (Chen et al., 22 Nov 2025), and (3) Hybrid Mamba-Transformer UNet extensions (Zhang et al., 21 Aug 2024). These models target enhanced segmentation accuracy and efficiency across both supervised and unsupervised regimes, exploiting hybrid losses and architectural components for optimal performance.
1. Architectural Principles
Hidden Markov Random Field U-Net (HMRF-UNet)
In HMRF-UNet, the core design consists of a classic U-Net encoder–decoder with three levels (feature channel progression: 64–128–256), accepting normalized 2D grayscale μCT slices. The output at each voxel is a soft confidence vector , interpreted as fuzzy labels for each class (e.g., PU matrix or pore) (Grolig et al., 14 Nov 2025). These confidence maps parameterize Gaussian mixtures, serving as soft assignments in an HMRF generative framework and are optimized via a hybrid loss without ground-truth labels.
Hybrid CNN–Mamba UNet
HyM-UNet architectures designed for medical segmentation utilize hierarchical encoders: shallow stages employ Residual Convolution Blocks for local texture, while deep stages deploy Visual Mamba (State Space Model, SSM) blocks to model global semantic dependencies at linear complexity. Critical to this design is the Mamba-Guided Fusion Skip Connection (MGF-Skip), which dynamically gates encoder features with decoder semantics to suppress background noise at ambiguous boundaries (Chen et al., 22 Nov 2025). Downstream, the decoder mirrors upsampling stages with gated skip fusion at each level.
Hybrid Mamba–Transformer UNet
The HMT-UNet variant introduces sequential stacking of MambaVision Mixer blocks (SSM-based) followed by Transformer self-attention layers in deep encoder and decoder stages. Skip connections use direct addition for encoder–decoder fusion (Zhang et al., 21 Aug 2024), targeting simultaneous local and long-range feature mixing at lower computational cost than full Transformer models.
2. Mathematical Loss Formulations
HMRF-UNet Loss
The unsupervised objective for HMRF-UNet is: where is the data fidelity (negative log-likelihood under fuzzy Gaussian mixtures), and enforces spatial smoothness via fuzzy neighborhood penalties (either Potts-type, i.e., squared difference of soft labels, or Banerjee-type, i.e., variance-weighted mean differences) (Grolig et al., 14 Nov 2025).
where denotes the chosen neighborhood term.
Hybrid CNN–Mamba UNet Loss
Medical HyM-UNet models use: where is the BCE loss restricted to mask boundaries (dilated with a 3×3 kernel), with typical coefficients (Chen et al., 22 Nov 2025).
Hybrid Mamba–Transformer UNet Loss
HMT-UNet employs: with equal weights, targeting balanced calibration for pixel-wise and set-wise overlap.
3. Training and Optimization Strategies
HMRF-UNet
Unsupervised training involves random initialization, Adam optimizer (lr ), batch size 128, up to 200 epochs on synthetic data. Bayesian hyperparameter search finds optimal U-Net depth, channel configuration, and neighborhood weighting. For supervised fine-tuning, pre-trained HMRF-UNet weights (Potts penalty, ) are transferred, then optimized on labeled subsets with Dice loss. Pre-training drastically reduces the amount of ground-truth data needed for convergence; with 5 GT slices, DSC approaches 0.98 versus 0.85 from scratch (Grolig et al., 14 Nov 2025).
Hybrid CNN–Mamba UNet
For segmentation on ISIC 2018, images are resized to , normalized, and augmented (random flips, rotations). AdamW optimizer with cosine annealing is applied over 200 epochs, batch size 24. The encoder combines local and global modules, with skip connections fused via learned gating (Chen et al., 22 Nov 2025).
Hybrid Mamba–Transformer UNet
HMT-UNet leverages AdamW (lr ), batch size 80, cosine scheduling, and heavy spatial augmentations. Training proceeds for 200 epochs on multiple medical segmentation datasets (Zhang et al., 21 Aug 2024).
4. Experimental Performance
Quantitative Results
Empirical findings show HyM-UNet variants outperform established U-Net models:
| Method | IoU (%) | Dice (%) | HD95 | PRE (%) |
|---|---|---|---|---|
| U-Net [ISIC18] | 79.32 | 87.03 | 4.74 | 88.98 |
| CE-Net [ISIC18] | 80.29 | 87.66 | 4.53 | 89.50 |
| Attention U-Net | 79.64 | 87.32 | 4.56 | 90.68 |
| HyM-UNet | 81.82 | 88.97 | 4.03 | 90.91 |
Unsupervised HMRF-UNet on artificial micro-CT data yields DSC with Potts neighborhood, closely matching supervised performance. Pre-training further elevates supervised segmentation quality at minimal GT annotation cost.
Generalization and Sample Efficiency
On ISIC-17 and Kvasir-SEG, HyM-UNet and HMT-UNet achieve Dice coefficients up to 90.74% and 92.28%, respectively, exceeding pure CNN/Transformer baselines (Zhang et al., 21 Aug 2024). The pre-training paradigm significantly improves sample efficiency in supervised fine-tuning (DSC increases up to +0.13 for 5 labeled slices).
5. Insights, Limitations, and Extensions
Neighborhood Term Contributions
Potts-type penalty (fuzzy label difference) outperforms Banerjee-type for binary segmentation. Custom weights enable Banerjee penalties to be competitive, though excess weighting can harm Potts performance. Lower contrast environments reduce Banerjee-loss effectiveness (Grolig et al., 14 Nov 2025).
Limitations
HMRF-UNet is currently limited to binary segmentation (); extension to multi-class tasks remains open. Thin wall segmentation remains challenging for unsupervised loss, especially on real μCT due to contrast and border artifacts. The model does not exploit additional texture or multi-spectral input, instead relying on intensity alone.
For Hybrid CNN-Mamba and Hybrid Mamba-Transformer UNet, the complexity of skip connection design and hybridization parameters demands careful tuning per dataset. Transformer elements (HMT-UNet) increase memory requirements and do not always provide clear improvement over SSM-only hybridization.
Extensions
Potential directions include:
- Incorporation of 3D convolutions and neighborhoods (26-connectivity) for volume data.
- Hybrid losses integrating contrastive or autoencoder terms.
- Semi-supervised multi-task setups with mixed unsupervised and supervised objectives.
- Adaptive class-number selection via Dirichlet/Bayesian priors.
- Extending skip fusion strategies to learn more complex feature interactions in multi-modal domains.
6. Implementation and Practical Guidance
PyTorch code examples detail core architectural blocks:
1 2 3 4 5 6 7 8 |
class ResidualConvBlock(nn.Module): def __init__(self, c): super().__init__() self.conv = nn.Sequential( nn.Conv2d(c, c, 3, padding=1), nn.BatchNorm2d(c), nn.ReLU(), nn.Conv2d(c, c, 3, padding=1), nn.BatchNorm2d(c)) def forward(self, x): return F.relu(self.conv(x) + x) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
class VSSBlock(nn.Module): def __init__(self, c, d): super().__init__() self.proj1 = nn.Linear(c, d) self.proj2 = nn.Linear(c, d) self.conv1d = nn.Conv1d(d, d, 1) self.to_out = nn.Linear(d, c) def forward(self, x): B,C,H,W = x.shape x_flat = x.flatten(2).transpose(1,2) q = F.silu(self.proj1(x_flat)) k = self.proj2(x_flat) v = self.conv1d(q.transpose(1,2)) # SS2D omitted; use v2 = v for illustration attn = torch.exp(q) @ (torch.exp(k).transpose(-2,-1) @ v.transpose(1,2)) attn = attn / (attn.sum(-1, keepdim=True) + 1e-6) out = self.to_out(attn) out = out.transpose(1,2).view(B,C,H,W) return out + x |
For resource-limited scenarios, reduce channel dimensions or eliminate one visual Mamba stage. Adjust patch embedding and convolution kernel sizes for volumetric segmentation.
7. Context and Impact in Segmentation Research
HyM-UNet establishes an effective paradigm for robust unsupervised and hybrid supervised medical image segmentation. By bridging locality (CNNs), global semantics (SSMs/Transformers), and spatial regularity (HMRF principles), these models reconcile annotation scarcity with segmentation fidelity across diverse materials and medical contexts.
Reference designs and results from Grolig et al. (Grolig et al., 14 Nov 2025), Liu et al. (Chen et al., 22 Nov 2025), and Zhang et al. (Zhang et al., 21 Aug 2024) indicate the increasing utility of hybrid U-Net frameworks employing SSMs, neural attention, and probabilistic graphical models for advanced segmentation tasks.