HyM-UNet: Hybrid U-Net Architectures

Updated 29 November 2025

HyM-UNet is a hybrid framework that integrates Hidden Markov Random Field, CNN-Mamba, and Transformer U-Net variants to deliver robust segmentation performance using hybrid loss functions.
It employs advanced architectural components like residual convolution blocks and state-space model modules to capture both local textures and long-range semantic dependencies.
Experimental results demonstrate that HyM-UNet variants consistently outperform standard U-Net models in metrics such as IoU, Dice coefficient, and boundary precision on diverse datasets.

HyM-UNet refers to several recent architectures unifying advanced neural and probabilistic techniques in image segmentation tasks. It primarily covers (1) Hidden Markov Random Field U-Net for unsupervised micro-CT segmentation (Grolig et al., 14 Nov 2025), (2) Hybrid CNN-Mamba UNet for medical imaging (Chen et al., 22 Nov 2025), and (3) Hybrid Mamba-Transformer UNet extensions (Zhang et al., 21 Aug 2024). These models target enhanced segmentation accuracy and efficiency across both supervised and unsupervised regimes, exploiting hybrid losses and architectural components for optimal performance.

1. Architectural Principles

Hidden Markov Random Field U-Net (HMRF-UNet)

In HMRF-UNet, the core design consists of a classic U-Net encoder–decoder with three levels (feature channel progression: 64–128–256), accepting normalized 2D grayscale μCT slices. The output at each voxel $s$ is a soft confidence vector $c_s \in \mathbb{R}^K$ , interpreted as fuzzy labels for each class (e.g., PU matrix or pore) (Grolig et al., 14 Nov 2025). These confidence maps parameterize Gaussian mixtures, serving as soft assignments in an HMRF generative framework and are optimized via a hybrid loss without ground-truth labels.

Hybrid CNN–Mamba UNet

HyM-UNet architectures designed for medical segmentation utilize hierarchical encoders: shallow stages employ Residual Convolution Blocks for local texture, while deep stages deploy Visual Mamba (State Space Model, SSM) blocks to model global semantic dependencies at linear complexity. Critical to this design is the Mamba-Guided Fusion Skip Connection (MGF-Skip), which dynamically gates encoder features with decoder semantics to suppress background noise at ambiguous boundaries (Chen et al., 22 Nov 2025). Downstream, the decoder mirrors upsampling stages with gated skip fusion at each level.

Hybrid Mamba–Transformer UNet

The HMT-UNet variant introduces sequential stacking of MambaVision Mixer blocks (SSM-based) followed by Transformer self-attention layers in deep encoder and decoder stages. Skip connections use direct addition for encoder–decoder fusion (Zhang et al., 21 Aug 2024), targeting simultaneous local and long-range feature mixing at lower computational cost than full Transformer models.

2. Mathematical Loss Formulations

HMRF-UNet Loss

The unsupervised objective for HMRF-UNet is: $\mathcal{L}(\theta) = \lambda_d\,\mathcal{L}_d + \lambda_n\,\mathcal{L}_n$ where $\mathcal{L}_d$ is the data fidelity (negative log-likelihood under fuzzy Gaussian mixtures), and $\mathcal{L}_n$ enforces spatial smoothness via fuzzy neighborhood penalties (either Potts-type, i.e., squared difference of soft labels, or Banerjee-type, i.e., variance-weighted mean differences) (Grolig et al., 14 Nov 2025).

$\mathcal{L}_d = \sum_{s \in S}\sum_{l=1}^K c_{s,l}\left[\frac{(y_s - \mu_l)^2}{2\,\sigma_l^2} + \ln \sigma_l\right]$

$\widetilde{E}_{\mathrm{potts}}(s) = \frac{\alpha_s}{|N_s|}\sum_{t \in N_s} \|c_s - c_t\|^2$

$\mathcal{L}_n = \sum_{s \in S}\alpha_s\,\widetilde{E}_*(s)$

where $*$ denotes the chosen neighborhood term.

Hybrid CNN–Mamba UNet Loss

Medical HyM-UNet models use: $\mathcal{L}_{\mathrm{total}} = \lambda_1 \mathcal{L}_{\mathrm{Dice}} + \lambda_2 \mathcal{L}_{\mathrm{BCE}} + \lambda_3 \mathcal{L}_{\mathrm{edge}}$ where $\mathcal{L}_{\mathrm{edge}}$ is the BCE loss restricted to mask boundaries (dilated with a 3×3 kernel), with typical coefficients $\lambda_1 = 1,\,\lambda_2 = 0.5,\,\lambda_3 = 0.5$ (Chen et al., 22 Nov 2025).

Hybrid Mamba–Transformer UNet Loss

HMT-UNet employs: $L = L_{\mathrm{BCE}} + L_{\mathrm{Dice}}$ with equal weights, targeting balanced calibration for pixel-wise and set-wise overlap.

3. Training and Optimization Strategies

HMRF-UNet

Unsupervised training involves random initialization, Adam optimizer (lr $1\times10^{-5}$ ), batch size 128, up to 200 epochs on synthetic data. Bayesian hyperparameter search finds optimal U-Net depth, channel configuration, and neighborhood weighting. For supervised fine-tuning, pre-trained HMRF-UNet weights (Potts penalty, $\lambda_n=0.31$ ) are transferred, then optimized on labeled subsets with Dice loss. Pre-training drastically reduces the amount of ground-truth data needed for convergence; with 5 GT slices, DSC approaches 0.98 versus 0.85 from scratch (Grolig et al., 14 Nov 2025).

Hybrid CNN–Mamba UNet

For segmentation on ISIC 2018, images are resized to $256\times256$ , normalized, and augmented (random flips, rotations). AdamW optimizer with cosine annealing is applied over 200 epochs, batch size 24. The encoder combines local and global modules, with skip connections fused via learned gating (Chen et al., 22 Nov 2025).

Hybrid Mamba–Transformer UNet

HMT-UNet leverages AdamW (lr $1\times10^{-3}$ ), batch size 80, cosine scheduling, and heavy spatial augmentations. Training proceeds for 200 epochs on multiple medical segmentation datasets (Zhang et al., 21 Aug 2024).

4. Experimental Performance

Quantitative Results

Empirical findings show HyM-UNet variants outperform established U-Net models:

Method	IoU (%)	Dice (%)	HD95	PRE (%)
U-Net [ISIC18]	79.32	87.03	4.74	88.98
CE-Net [ISIC18]	80.29	87.66	4.53	89.50
Attention U-Net	79.64	87.32	4.56	90.68
HyM-UNet	81.82	88.97	4.03	90.91

Unsupervised HMRF-UNet on artificial micro-CT data yields DSC $0.957 \pm 0.017$ with Potts neighborhood, closely matching supervised performance. Pre-training further elevates supervised segmentation quality at minimal GT annotation cost.

Generalization and Sample Efficiency

On ISIC-17 and Kvasir-SEG, HyM-UNet and HMT-UNet achieve Dice coefficients up to 90.74% and 92.28%, respectively, exceeding pure CNN/Transformer baselines (Zhang et al., 21 Aug 2024). The pre-training paradigm significantly improves sample efficiency in supervised fine-tuning (DSC increases up to +0.13 for 5 labeled slices).

5. Insights, Limitations, and Extensions

Neighborhood Term Contributions

Potts-type penalty (fuzzy label difference) outperforms Banerjee-type for binary segmentation. Custom weights enable Banerjee penalties to be competitive, though excess weighting can harm Potts performance. Lower contrast environments reduce Banerjee-loss effectiveness (Grolig et al., 14 Nov 2025).

Limitations

HMRF-UNet is currently limited to binary segmentation ( $K=2$ ); extension to multi-class tasks remains open. Thin wall segmentation remains challenging for unsupervised loss, especially on real μCT due to contrast and border artifacts. The model does not exploit additional texture or multi-spectral input, instead relying on intensity alone.

For Hybrid CNN-Mamba and Hybrid Mamba-Transformer UNet, the complexity of skip connection design and hybridization parameters demands careful tuning per dataset. Transformer elements (HMT-UNet) increase memory requirements and do not always provide clear improvement over SSM-only hybridization.

Extensions

Potential directions include:

Incorporation of 3D convolutions and neighborhoods (26-connectivity) for volume data.
Hybrid losses integrating contrastive or autoencoder terms.
Semi-supervised multi-task setups with mixed unsupervised and supervised objectives.
Adaptive class-number selection via Dirichlet/Bayesian priors.
Extending skip fusion strategies to learn more complex feature interactions in multi-modal domains.

6. Implementation and Practical Guidance

PyTorch code examples detail core architectural blocks:

class ResidualConvBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(c, c, 3, padding=1), nn.BatchNorm2d(c), nn.ReLU(),
            nn.Conv2d(c, c, 3, padding=1), nn.BatchNorm2d(c))
    def forward(self, x):
        return F.relu(self.conv(x) + x)

class VSSBlock(nn.Module):
    def __init__(self, c, d):
        super().__init__()
        self.proj1 = nn.Linear(c, d)
        self.proj2 = nn.Linear(c, d)
        self.conv1d = nn.Conv1d(d, d, 1)
        self.to_out = nn.Linear(d, c)
    def forward(self, x):
        B,C,H,W = x.shape
        x_flat = x.flatten(2).transpose(1,2)
        q = F.silu(self.proj1(x_flat))
        k = self.proj2(x_flat)
        v = self.conv1d(q.transpose(1,2))
        # SS2D omitted; use v2 = v for illustration
        attn = torch.exp(q) @ (torch.exp(k).transpose(-2,-1) @ v.transpose(1,2))
        attn = attn / (attn.sum(-1, keepdim=True) + 1e-6)
        out = self.to_out(attn)
        out = out.transpose(1,2).view(B,C,H,W)
        return out + x

For resource-limited scenarios, reduce channel dimensions or eliminate one visual Mamba stage. Adjust patch embedding and convolution kernel sizes for volumetric segmentation.

7. Context and Impact in Segmentation Research

HyM-UNet establishes an effective paradigm for robust unsupervised and hybrid supervised medical image segmentation. By bridging locality (CNNs), global semantics (SSMs/Transformers), and spatial regularity (HMRF principles), these models reconcile annotation scarcity with segmentation fidelity across diverse materials and medical contexts.

Reference designs and results from Grolig et al. (Grolig et al., 14 Nov 2025), Liu et al. (Chen et al., 22 Nov 2025), and Zhang et al. (Zhang et al., 21 Aug 2024) indicate the increasing utility of hybrid U-Net frameworks employing SSMs, neural attention, and probabilistic graphical models for advanced segmentation tasks.

PDF Markdown Chat (Pro)

References (3)

Unsupervised Segmentation of Micro-CT Scans of Polyurethane Structures By Combining Hidden-Markov-Random Fields and a U-Net (2025)

HyM-UNet: Synergizing Local Texture and Global Context via Hybrid CNN-Mamba Architecture for Medical Image Segmentation (2025)

HMT-UNet: A hybird Mamba-Transformer Vision UNet for Medical Image Segmentation (2024)

HyM-UNet: Hybrid U-Net Architectures

1. Architectural Principles

Hidden Markov Random Field U-Net (HMRF-UNet)

Hybrid CNN–Mamba UNet

Hybrid Mamba–Transformer UNet

2. Mathematical Loss Formulations

HMRF-UNet Loss

Hybrid CNN–Mamba UNet Loss

Hybrid Mamba–Transformer UNet Loss

3. Training and Optimization Strategies

HMRF-UNet

Hybrid CNN–Mamba UNet

Hybrid Mamba–Transformer UNet

4. Experimental Performance

Quantitative Results

Generalization and Sample Efficiency

5. Insights, Limitations, and Extensions

Neighborhood Term Contributions

Limitations

Extensions

6. Implementation and Practical Guidance

7. Context and Impact in Segmentation Research

Whiteboard

Follow Topic

Continue Learning

HyM-UNet: Hybrid U-Net Architectures

1. Architectural Principles

Hidden Markov Random Field U-Net (HMRF-UNet)

Hybrid CNN–Mamba UNet

Hybrid Mamba–Transformer UNet

2. Mathematical Loss Formulations

HMRF-UNet Loss

Hybrid CNN–Mamba UNet Loss

Hybrid Mamba–Transformer UNet Loss

3. Training and Optimization Strategies

HMRF-UNet

Hybrid CNN–Mamba UNet

Hybrid Mamba–Transformer UNet

4. Experimental Performance

Quantitative Results

Generalization and Sample Efficiency

5. Insights, Limitations, and Extensions

Neighborhood Term Contributions

Limitations

Extensions

6. Implementation and Practical Guidance

7. Context and Impact in Segmentation Research

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics