DoubleU-NetPlus: Dual U-Net Segmentation
- The paper introduces a dual U-Net design that integrates multi-contextual attention, multi-scale feature fusion, and an EfficientNetB7 backbone for enhanced medical image segmentation.
- It employs multi-kernel residual convolution, SE-ASPP, and hybrid triple attention modules to improve feature extraction and refine ambiguous boundaries.
- Experimental results show significant gains in Dice and mIoU metrics across several public datasets, validating its superior segmentation performance.
DoubleU-NetPlus is a dual U-Net-based architecture enhanced by multi-contextual attention mechanisms, multi-scale residual feature fusion, and a strong backbone feature extractor, specifically designed for semantic segmentation of medical images. The network addresses challenges with traditional and contemporary U-Net variants (e.g., CE-Net, DoubleU-Net) regarding multi-scale region modeling, texture complexity, and ambiguous boundaries by exploiting attention-guided modules and context refinement for improved discriminative feature representation (Ahmed et al., 2022).
1. Architectural Composition and Information Flow
DoubleU-NetPlus comprises two stacked U-Net encoder–decoder networks ("U1" and "U2"), forming an end-to-end cascade:
- U1: Receives input image, processes with an EfficientNetB7 encoder, contextual bridge modules, and a decoder to output an initial segmentation mask (Mask1).
- U2: Accepts the element-wise product of Mask1 and the raw input, utilizing its own encoder, identical bridge modules, and decoder, yielding final mask (Mask2).
Skip connections in each U-Net incorporate a Triple Attention Gate (TAG) and multi-context fusion at each stage. The information flow is structured:
| Stage | Input | Bridge (Context Modules) | Decoder (Skip Connections) | Output |
|---|---|---|---|---|
| U1 | Input image | MKRC → SE-ASPP → Hybrid TAM | TAG-gated encoder features | Mask1 |
| U2 | Input ⊙ Mask1 | MKRC → SE-ASPP → Hybrid TAM | TAG-gated skips from U1 & U2 encoder | Mask2 |
This design allows progressive refinement, focusing the second network on regions of interest highlighted by U1.
2. Feature Extraction and Context Modules
EfficientNetB7 Encoder Integration
The first U-Net’s encoder is EfficientNetB7, using all pretrained weights and MBConv blocks, producing feature maps at various fractional input resolutions (1/2, 1/4, 1/8, 1/16, 1/32). No architectural changes are made within EfficientNetB7; final MBConv outputs serve as inputs to bridge modules.
Multi-Kernel Residual Convolution (MKRC)
MKRC expands receptive fields and enables multi-context feature mapping by parallel Conv operations with kernel sizes :
Features from all branches are concatenated, channel-reduced via convolution, and residual identity mapping is merged, yielding:
SE-ASPP Module
The bridge module applies Squeeze-and-Excitation Atrous Spatial Pyramid Pooling (SE-ASPP):
- Seven parallel atrous convolutions (), each followed by channel squeeze-excitation.
- Fused results are concatenated and reduced via conv.
Specifically, Squeeze-and-Excitation normalizes each branch using:
Hybrid Triple Attention Module (TAM)
TAM refines ASPP outputs with three parallel branches:
- Squeeze-and-Excitation (SE)
- Channel Attention (CA): max and avg pooling, followed by FC layers and activation
- Spatial Attention (SA): conv on pooled features
TAM combines these as follows:
3. Attention and Fusion Mechanisms
Triple Attention Gate (TAG)
TAG modulates all skip connections:
- Feature and gate signals are projected to matching dimensions via convolutions and summed.
- Result passes through SE, CA, SA branches, generating an attention coefficient .
- The gated skip is .
Attention-Guided Residual Convolution (AG-Residual)
AG-Residual blocks replace standard convolutions in U2.encoder and both decoders. They concatenate a double Conv branch with a Conv identity branch, followed by TAM for selective refinement.
Multi-Scale Residual Feature Fusion
At each decoder stage:
This concatenation, gated by TAG, propagates high-resolution and context-enriched features throughout the decoder.
4. Training Protocols and Optimization
- Loss Function: Binary cross entropy (BCE) plus Dice loss:
- Optimizer: Adam; initial learning rate , reduced by $0.1$ if validation loss stagnates for 10 epochs.
- Batch size: $4$
- Augmentations: Random rotations, flips, intensity transformations, grid distortions (22–25 variants per dataset); input resized to .
5. Experimental Evaluation and Comparative Results
DoubleU-NetPlus was evaluated on six public datasets: DRIVE, LUNA, BUSI, CVCclinicDB, 2018 DSB, ISBI 2012. Metrics include precision, recall, Dice, and mIoU.
For DRIVE dataset:
| Method | Dice (%) | mIoU (%) |
|---|---|---|
| U-Net | 77.2 | 62.9 |
| DoubleU-NetPlus | 85.17 | 73.92 |
Ablation studies indicate removal of modules (MKRC, TAM, TAG) causes $2$– degradation in Dice. Qualitatively, DoubleU-NetPlus yields sharper edges and superior recovery of microstructures such as fine vessels and small lesions.
Comparisons across all datasets show DoubleU-NetPlus outperforms U-Net, U-Net++, Attention U-Net, MultiResU-Net, CE-Net, DoubleU-Net in terms of both quantitative metrics and qualitative boundary fidelity (Ahmed et al., 2022).
6. Algorithmic Summary and Implementation Details
Pseudocode for the end-to-end segmentation pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def ForwardPass(x): # U1 branch E1 = EfficientNetB7_Encoder(x) B1 = TAM(SE-ASPP(MKRC(E1_end))) D1 = U_Decoder(B1, E1_skips) Mask1 = Sigmoid(Conv1x1(D1_last)) # U2 branch x2 = x * Mask1 E2 = AG_ResEncoder(x2) B2 = TAM(SE-ASPP(MKRC(E2_end))) D2 = U_Decoder(B2, [E1_skips, E2_skips]) Mask2 = Sigmoid(Conv1x1(D2_last)) return Mask1, Mask2 |
Training employs a combined BCE and Dice loss, iterative gradient updates via Adam, and validation-based learning rate scheduling. Each batch undergoes extensive augmentation for robust generalization.
7. Contextualization and Distinction from Related Models
DoubleU-NetPlus builds on challenges documented for U-Net and its advanced variants. Architectural choices such as EfficientNetB7 encoding, MKRC, SE-ASPP, hybrid TAM, and multi-scale feature fusion are empirically shown to enhance segmentation accuracy, boundary clarity, and feature discrimination, particularly in complex medical imaging scenarios with scale variance and texture ambiguity. Ablation studies highlight the necessity of each attention and fusion component. The systematic performance improvement and module significance are corroborated in (Ahmed et al., 2022).
A plausible implication is that further improvements may be achievable by refining attention gate mechanisms or increasing bridge context depth, but module removal distinctly degrades performance. No controversies or dissenting experimental reports are present in these references.
The DoubleU-NetPlus architecture constitutes a leading dual-U-Net-based pipeline for context- and attention-guided medical image segmentation, setting quantitative and qualitative state-of-the-art results as of its introduction (Ahmed et al., 2022).