DoubleU-NetPlus: Dual U-Net Segmentation

Updated 22 December 2025

The paper introduces a dual U-Net design that integrates multi-contextual attention, multi-scale feature fusion, and an EfficientNetB7 backbone for enhanced medical image segmentation.
It employs multi-kernel residual convolution, SE-ASPP, and hybrid triple attention modules to improve feature extraction and refine ambiguous boundaries.
Experimental results show significant gains in Dice and mIoU metrics across several public datasets, validating its superior segmentation performance.

DoubleU-NetPlus is a dual U-Net-based architecture enhanced by multi-contextual attention mechanisms, multi-scale residual feature fusion, and a strong backbone feature extractor, specifically designed for semantic segmentation of medical images. The network addresses challenges with traditional and contemporary U-Net variants (e.g., CE-Net, DoubleU-Net) regarding multi-scale region modeling, texture complexity, and ambiguous boundaries by exploiting attention-guided modules and context refinement for improved discriminative feature representation (Ahmed et al., 2022).

1. Architectural Composition and Information Flow

DoubleU-NetPlus comprises two stacked U-Net encoder–decoder networks ("U1" and "U2"), forming an end-to-end cascade:

U1: Receives input image, processes with an EfficientNetB7 encoder, contextual bridge modules, and a decoder to output an initial segmentation mask (Mask1).
U2: Accepts the element-wise product of Mask1 and the raw input, utilizing its own encoder, identical bridge modules, and decoder, yielding final mask (Mask2).

Skip connections in each U-Net incorporate a Triple Attention Gate (TAG) and multi-context fusion at each stage. The information flow is structured:

Stage	Input	Bridge (Context Modules)	Decoder (Skip Connections)	Output
U1	Input image	MKRC → SE-ASPP → Hybrid TAM	TAG-gated encoder features	Mask1
U2	Input ⊙ Mask1	MKRC → SE-ASPP → Hybrid TAM	TAG-gated skips from U1 & U2 encoder	Mask2

This design allows progressive refinement, focusing the second network on regions of interest highlighted by U1.

2. Feature Extraction and Context Modules

EfficientNetB7 Encoder Integration

The first U-Net’s encoder is EfficientNetB7, using all pretrained weights and MBConv blocks, producing feature maps at various fractional input resolutions (1/2, 1/4, 1/8, 1/16, 1/32). No architectural changes are made within EfficientNetB7; final MBConv outputs serve as inputs to bridge modules.

Multi-Kernel Residual Convolution (MKRC)

MKRC expands receptive fields and enables multi-context feature mapping by parallel Conv operations with kernel sizes $\{1, 3, 5, 7\}$ :

$F_k = \text{ReLU}(\text{BN}(\text{Conv}_{k\times k}(X)))$

Features from all branches are concatenated, channel-reduced via $1\times 1$ convolution, and residual identity mapping is merged, yielding:

$\text{MKRC}_{out} = \text{ReLU}(\text{Concat}(F_1, F_3, F_5, F_7, R))$

SE-ASPP Module

The bridge module applies Squeeze-and-Excitation Atrous Spatial Pyramid Pooling (SE-ASPP):

Seven parallel $3\times 3$ atrous convolutions ( $r \in \{1, 1, 2, 6, 10, 13, 16\}$ ), each followed by channel squeeze-excitation.
Fused results are concatenated and reduced via $1\times 1$ conv.

Specifically, Squeeze-and-Excitation normalizes each branch using:

$SE(A) = A \cdot \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot GAP(A)))$

Hybrid Triple Attention Module (TAM)

TAM refines ASPP outputs with three parallel branches:

Squeeze-and-Excitation (SE)
Channel Attention (CA): max and avg pooling, followed by FC layers and activation
Spatial Attention (SA): $7 \times 7$ conv on pooled features

TAM combines these as follows:

$\text{TAM}_{out} = \text{ReLU}(\text{Conv}_{1x1}([\text{SE}; \text{CA}; \text{SA}]))$

3. Attention and Fusion Mechanisms

Triple Attention Gate (TAG)

TAG modulates all skip connections:

Feature and gate signals are projected to matching dimensions via $1 \times 1$ convolutions and summed.
Result passes through SE, CA, SA branches, generating an attention coefficient $\alpha$ .
The gated skip is $E \cdot \text{Sigmoid}(\text{Upsample}_2(\alpha))$ .

Attention-Guided Residual Convolution (AG-Residual)

AG-Residual blocks replace standard convolutions in U2.encoder and both decoders. They concatenate a double $3 \times 3$ Conv branch with a $1 \times 1$ Conv identity branch, followed by TAM for selective refinement.

Multi-Scale Residual Feature Fusion

At each decoder stage:

$F_{out} = \text{AG-Res}(\text{Concat}(Up_2(F_{dec}), TAG(F_{enc}), [TAG(F_{enc2})]))$

This concatenation, gated by TAG, propagates high-resolution and context-enriched features throughout the decoder.

4. Training Protocols and Optimization

Loss Function: Binary cross entropy (BCE) plus Dice loss: $L_{total} = L_{BCE} + L_{Dice}$ $L_{t o t a l} = L_{BCE} + L_{D i ce}$
- $L_{BCE} = -[y \log p + (1-y) \log(1-p)]$
- $L_{Dice} = 1 - \frac{2 \sum p_i g_i + \epsilon}{\sum p_i^2 + \sum g_i^2 + \epsilon}$
Optimizer: Adam; initial learning rate $1 \times 10^{-4}$ , reduced by $0.1$ if validation loss stagnates for 10 epochs.
Batch size: $4$
Augmentations: Random rotations, flips, intensity transformations, grid distortions (22–25 variants per dataset); input resized to $256 \times 256$ .

5. Experimental Evaluation and Comparative Results

DoubleU-NetPlus was evaluated on six public datasets: DRIVE, LUNA, BUSI, CVCclinicDB, 2018 DSB, ISBI 2012. Metrics include precision, recall, Dice, and mIoU.

For DRIVE dataset:

Method	Dice (%)	mIoU (%)
U-Net	77.2	62.9
DoubleU-NetPlus	85.17	73.92

Ablation studies indicate removal of modules (MKRC, TAM, TAG) causes $2$– $4\%$ degradation in Dice. Qualitatively, DoubleU-NetPlus yields sharper edges and superior recovery of microstructures such as fine vessels and small lesions.

Comparisons across all datasets show DoubleU-NetPlus outperforms U-Net, U-Net++, Attention U-Net, MultiResU-Net, CE-Net, DoubleU-Net in terms of both quantitative metrics and qualitative boundary fidelity (Ahmed et al., 2022).

6. Algorithmic Summary and Implementation Details

Pseudocode for the end-to-end segmentation pipeline:

def ForwardPass(x):
    # U1 branch
    E1 = EfficientNetB7_Encoder(x)
    B1 = TAM(SE-ASPP(MKRC(E1_end)))
    D1 = U_Decoder(B1, E1_skips)
    Mask1 = Sigmoid(Conv1x1(D1_last))
    # U2 branch
    x2 = x * Mask1
    E2 = AG_ResEncoder(x2)
    B2 = TAM(SE-ASPP(MKRC(E2_end)))
    D2 = U_Decoder(B2, [E1_skips, E2_skips])
    Mask2 = Sigmoid(Conv1x1(D2_last))
    return Mask1, Mask2

Training employs a combined BCE and Dice loss, iterative gradient updates via Adam, and validation-based learning rate scheduling. Each batch undergoes extensive augmentation for robust generalization.

DoubleU-NetPlus builds on challenges documented for U-Net and its advanced variants. Architectural choices such as EfficientNetB7 encoding, MKRC, SE-ASPP, hybrid TAM, and multi-scale feature fusion are empirically shown to enhance segmentation accuracy, boundary clarity, and feature discrimination, particularly in complex medical imaging scenarios with scale variance and texture ambiguity. Ablation studies highlight the necessity of each attention and fusion component. The systematic performance improvement and module significance are corroborated in (Ahmed et al., 2022).

A plausible implication is that further improvements may be achievable by refining attention gate mechanisms or increasing bridge context depth, but module removal distinctly degrades performance. No controversies or dissenting experimental reports are present in these references.

The DoubleU-NetPlus architecture constitutes a leading dual-U-Net-based pipeline for context- and attention-guided medical image segmentation, setting quantitative and qualitative state-of-the-art results as of its introduction (Ahmed et al., 2022).

PDF Markdown Chat (Pro)

References (1)

DoubleU-NetPlus: A Novel Attention and Context Guided Dual U-Net with Multi-Scale Residual Feature Fusion Network for Semantic Segmentation of Medical Images (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DoubleU-NetPlus.