Papers
Topics
Authors
Recent
2000 character limit reached

U-CycleMLP: Efficient Medical Segmentation

Updated 13 December 2025
  • The paper introduces U-CycleMLP, a U-shaped encoder–decoder architecture that combines dense atrous convolutions with CycleMLP-based skip refinement to enhance segmentation precision.
  • It integrates spatial attention (PAWE) with global context modeling (CCM) to accurately delineate fine boundaries and reduce common segmentation errors in medical images.
  • U-CycleMLP achieves state-of-the-art performance with improved F1, IoU, and DSC metrics while maintaining computational efficiency compared to traditional UNet and Transformer models.

U-CycleMLP is a U-shaped encoder–decoder architecture for medical image segmentation that integrates spatially aware local feature extraction with efficient global context modeling. The architecture is designed to enhance segmentation accuracy, particularly in delineating fine-grained boundaries, while maintaining computational efficiency through a combination of novel attention mechanisms, dense atrous convolutions, linear-complexity MLP-based modules, and optimal sampling strategies (Alzu'bi et al., 6 Dec 2025).

1. Motivation and Core Challenges

In medical image segmentation, precise boundary detection (spatial awareness) and the integration of distant anatomical information (global dependencies) are critical. Standard convolutional networks often exhibit limited receptive fields, resulting in suboptimal modeling of long-range spatial dependencies and frequent boundary pixel loss, manifesting as over- or under-segmentation in challenging regions such as thin or low-contrast anatomical structures. Transformer-based models improve global context aggregation but incur high computational costs due to quadratic complexity. Lightweight MLP-based architectures, though computationally efficient, often lack spatial inductive bias, undermining edge precision.

U-CycleMLP addresses these issues by combining:

  • Effective local feature extraction via densely connected and atrous convolutions.
  • Linear-complexity global modeling through channel-wise CycleMLP modules along skip connections.
  • Attention mechanisms tailored for both spatial and channel dimensions.

2. Architecture Overview and Components

U-CycleMLP adheres to a five-stage U-shaped encoder–decoder configuration. Encoder and decoder stages are symmetric, with the following structure:

  • Encoder: At each stage s{0,...,5}s \in \{0, ..., 5\}, the encoder produces feature maps FsRH2s×W2s×CsF_s \in \mathbb{R}^{\tfrac{H}{2^s} \times \tfrac{W}{2^s} \times C_s}, where Cs=2sCC_s = 2^s C and C=32C=32. The encoder incorporates Position Attention Weight Excitation (PAWE) blocks for spatial attention, and Dense Atrous (DA) blocks for multiscale context, interleaved with max-pooling downsampling.
  • Decoder: Features are reconstructed to full resolution using transposed convolutions (upsampling), DA blocks for multiscale integration, and feature fusion. At each stage, the skip connection features are refined via Channel CycleMLP (CCM) blocks before concatenation with the upsampled decoder output.

Table 1. High-level architectural operations by stage

Module Encoder Decoder
Attention Block PAWE CCM on skip-connection
Context Aggregation Dense Atrous (DA) Dense Atrous (DA)
Sampling Operation Max-Pooling (stride 2) Transposed Conv (stride 2)
Feature Fusion - Concatenation + DA

3. Spatial and Contextual Feature Extraction

3.1 Position Attention Weight Excitation (PAWE)

PAWE combines spatial self-attention with weight excitation. The core operations are:

  • YY (local features) are mapped to QQ, KK, VV via 3×33\times3 convolutions.
  • The spatial affinity map is S=softmax(QK)RN×NS = \mathrm{softmax}(Q K^{\top}) \in \mathbb{R}^{N \times N}, capturing pixel-wise contextual relevance.
  • The position-attention output is FPA=αSV+YF^{\mathrm{PA}} = \alpha S V + Y, with α\alpha learnable.
  • Weight excitation, FWEF^{\mathrm{WE}}, is generated using global average pooling followed by two fully connected layers and a sigmoid activation.
  • The fused output is F0=FPA+FWEF_0 = F^{\mathrm{PA}} + F^{\mathrm{WE}}.

3.2 Dense Atrous (DA) Blocks

Each DA block fuses:

  • Dense convolutions for local feature reuse: F0()=C()([F0(0),...,F0(1)])F_0^{(\ell)} = \mathcal{C}^{(\ell)}([F_0^{(0)}, ..., F_0^{(\ell-1)}]), stacking convolution, batch normalization, and activation over LL layers.
  • Atrous convolutions extend the receptive field without additional parameters: the effective field for a k×kk\times k kernel at dilation rr is RF=k+(k1)(r1)\mathrm{RF}=k+(k-1)(r-1).
  • The DA output at encoder stage ss is Fs=FDC+FAC=DAe(s)(Fs1)F_s = F^{\mathrm{DC}} + F^{\mathrm{AC}} = \mathrm{DA}_e^{(s)}(F_{s-1}).

Max-pooling is used for downsampling based on ablation showing >15%>15\% average DSC improvement over Patch-Merging. Decoder upsampling uses transposed convolutions.

4. CycleMLP-based Skip-Feature Fusion

4.1 Channel CycleMLP (CCM) Block

Introduced along each skip connection, CCM blocks refine feature integration with linear computational complexity. The CCM consists of:

  • Channel Attention Weight Excitation (CAWE):
    • Channel affinity: Cs=softmax(FsFs)RCs×CsC_s = \mathrm{softmax}(F_s F_s^{\top}) \in \mathbb{R}^{C_s \times C_s}.
    • Channel-attention output: FsCA=βFsCs+FsF_s^{\mathrm{CA}} = \beta F_s C_s^{\top} + F_s.
    • Weight-excitation: FsWE=F_s^{\mathrm{WE}}= two-layer MLP applied to globally averaged features.
    • CAWE fusion: FsCAWE=FsCA+FsWEF_s^{\mathrm{CAWE}} = F_s^{\mathrm{CA}} + F_s^{\mathrm{WE}}.
  • CycleMLP Layer:
    • Feature maps XX undergo cyclic shifts along height and width. Linear projections (LP(h),LP(w))(\mathrm{LP}^{(h)}, \mathrm{LP}^{(w)}) are applied to flattened, shifted feature maps.
    • Outputs are combined: Z=Z(h)+Z(w)+Z(p)Z = Z^{(h)} + Z^{(w)} + Z^{(p)}, followed by a sigmoid.
  • Skip Feature Refinement:
    • The final skip feature is Fsskip=σ(CycleMLP(FsCAWE))F_s^{\mathrm{skip}} = \sigma(\mathrm{CycleMLP}(F_s^{\mathrm{CAWE}})).

Each CCM block operates in O(NsCs)O(N_s C_s), i.e., O(HWC)O(HW\,C) for the entire network, ensuring scalability.

5. Quantitative and Qualitative Evaluation

5.1 Benchmarks and Metrics

  • Datasets: ISIC (dermoscopic), BUSI (breast ultrasound), and ACDC (cardiac MRI).
  • Metrics: F1, IoU, and Dice Similarity Coefficient (DSC) across anatomical targets.
  • Training: AdamW optimizer, learning rate 2×1042 \times 10^{-4}, 50 epochs (ISIC/BUSI), 129 epochs (ACDC), input 224×224224\times224.

Table 2. Segmentation accuracy: U-CycleMLP vs. recent baselines

Method ISIC F1 ISIC IoU BUSI F1 BUSI IoU ACDC Avg DSC
MD-UNet 91.58 84.81 - - -
FM-UNet 89.95 82.14 80.53 70.21 -
TransUNet - - - - 89.71
Swin-UNet - - - - 90.00
U-CycleMLP 92.76 86.50 84.85 78.86 91.11

U-CycleMLP achieves the highest F1/IoU on ISIC and BUSI, as well as best average DSC and myocardium segmentation accuracy on ACDC.

5.2 Qualitative Analysis

U-CycleMLP exhibits improved edge alignment, reducing false positives/negatives at irregular lesion boundaries (ISIC/BUSI) and maintaining precise ventricular and myocardial boundaries in low-contrast cardiac MRI slices (ACDC).

6. Ablation Studies and Component Effectiveness

Ablation experiments on ACDC (DSC) establish the importance of key components:

Table 3. Ablation: Impact of architectural modules (ACDC Avg DSC)

Configuration RV Myo LV Avg
w/o CCM 88.08 87.55 95.07 90.23
with CCM 89.28 88.44 95.63 91.11
Patch-Merging down 79.59 82.19 75.04 78.94
Max-pooling down 89.28 88.44 95.63 91.11
Bilinear upsampling 80.53 79.00 83.22 80.91
Patch-Expanding up 88.74 89.61 94.18 90.84
Transposed Conv up 89.28 88.44 95.63 91.11

Findings:

  • Incorporating CCM yields an ≈0.9% improvement in average DSC.
  • Max-pooling surpasses Patch-Merging by over 12% average DSC.
  • Transposed convolution offers better accuracy than bilinear and patch-expanding upsampling methods.

7. Model Complexity and Efficiency

Table 4. Model complexity (224×224224 \times 224 input)

Model Parameters (M) FLOPs (G)
UNet 31 104
TransUNet 96 250
U-CycleMLP 25 80

U-CycleMLP is more parameter- and compute-efficient than UNet and TransUNet, attributed to its linear-complexity CycleMLP and the use of efficient DA blocks.


U-CycleMLP demonstrates a principled approach to bridging the gap between spatially localized feature learning and global context aggregation in medical image segmentation. Its architectural innovations—particularly the combination of PAWE/DA blocks and CCM skip refinement—enable state-of-the-art segmentation performance across diverse modalities while maintaining a lightweight, computation-efficient profile (Alzu'bi et al., 6 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to U-CycleMLP.