U-CycleMLP: Efficient Medical Segmentation

Updated 13 December 2025

The paper introduces U-CycleMLP, a U-shaped encoder–decoder architecture that combines dense atrous convolutions with CycleMLP-based skip refinement to enhance segmentation precision.
It integrates spatial attention (PAWE) with global context modeling (CCM) to accurately delineate fine boundaries and reduce common segmentation errors in medical images.
U-CycleMLP achieves state-of-the-art performance with improved F1, IoU, and DSC metrics while maintaining computational efficiency compared to traditional UNet and Transformer models.

U-CycleMLP is a U-shaped encoder–decoder architecture for medical image segmentation that integrates spatially aware local feature extraction with efficient global context modeling. The architecture is designed to enhance segmentation accuracy, particularly in delineating fine-grained boundaries, while maintaining computational efficiency through a combination of novel attention mechanisms, dense atrous convolutions, linear-complexity MLP-based modules, and optimal sampling strategies (Alzu'bi et al., 6 Dec 2025).

1. Motivation and Core Challenges

In medical image segmentation, precise boundary detection (spatial awareness) and the integration of distant anatomical information (global dependencies) are critical. Standard convolutional networks often exhibit limited receptive fields, resulting in suboptimal modeling of long-range spatial dependencies and frequent boundary pixel loss, manifesting as over- or under-segmentation in challenging regions such as thin or low-contrast anatomical structures. Transformer-based models improve global context aggregation but incur high computational costs due to quadratic complexity. Lightweight MLP-based architectures, though computationally efficient, often lack spatial inductive bias, undermining edge precision.

U-CycleMLP addresses these issues by combining:

Effective local feature extraction via densely connected and atrous convolutions.
Linear-complexity global modeling through channel-wise CycleMLP modules along skip connections.
Attention mechanisms tailored for both spatial and channel dimensions.

2. Architecture Overview and Components

U-CycleMLP adheres to a five-stage U-shaped encoder–decoder configuration. Encoder and decoder stages are symmetric, with the following structure:

Encoder: At each stage $s \in \{0, ..., 5\}$ , the encoder produces feature maps $F_s \in \mathbb{R}^{\tfrac{H}{2^s} \times \tfrac{W}{2^s} \times C_s}$ , where $C_s = 2^s C$ and $C=32$ . The encoder incorporates Position Attention Weight Excitation (PAWE) blocks for spatial attention, and Dense Atrous (DA) blocks for multiscale context, interleaved with max-pooling downsampling.
Decoder: Features are reconstructed to full resolution using transposed convolutions (upsampling), DA blocks for multiscale integration, and feature fusion. At each stage, the skip connection features are refined via Channel CycleMLP (CCM) blocks before concatenation with the upsampled decoder output.

Table 1. High-level architectural operations by stage

Module	Encoder	Decoder
Attention Block	PAWE	CCM on skip-connection
Context Aggregation	Dense Atrous (DA)	Dense Atrous (DA)
Sampling Operation	Max-Pooling (stride 2)	Transposed Conv (stride 2)
Feature Fusion	-	Concatenation + DA

3. Spatial and Contextual Feature Extraction

3.1 Position Attention Weight Excitation (PAWE)

PAWE combines spatial self-attention with weight excitation. The core operations are:

$Y$ (local features) are mapped to $Q$ , $K$ , $V$ via $3\times3$ convolutions.
The spatial affinity map is $S = \mathrm{softmax}(Q K^{\top}) \in \mathbb{R}^{N \times N}$ , capturing pixel-wise contextual relevance.
The position-attention output is $F^{\mathrm{PA}} = \alpha S V + Y$ , with $\alpha$ learnable.
Weight excitation, $F^{\mathrm{WE}}$ , is generated using global average pooling followed by two fully connected layers and a sigmoid activation.
The fused output is $F_0 = F^{\mathrm{PA}} + F^{\mathrm{WE}}$ .

3.2 Dense Atrous (DA) Blocks

Each DA block fuses:

Dense convolutions for local feature reuse: $F_0^{(\ell)} = \mathcal{C}^{(\ell)}([F_0^{(0)}, ..., F_0^{(\ell-1)}])$ , stacking convolution, batch normalization, and activation over $L$ layers.
Atrous convolutions extend the receptive field without additional parameters: the effective field for a $k\times k$ kernel at dilation $r$ is $\mathrm{RF}=k+(k-1)(r-1)$ .
The DA output at encoder stage $s$ is $F_s = F^{\mathrm{DC}} + F^{\mathrm{AC}} = \mathrm{DA}_e^{(s)}(F_{s-1})$ .

Max-pooling is used for downsampling based on ablation showing $>15\%$ average DSC improvement over Patch-Merging. Decoder upsampling uses transposed convolutions.

4. CycleMLP-based Skip-Feature Fusion

4.1 Channel CycleMLP (CCM) Block

Introduced along each skip connection, CCM blocks refine feature integration with linear computational complexity. The CCM consists of:

Channel Attention Weight Excitation (CAWE):
- Channel affinity: $C_s = \mathrm{softmax}(F_s F_s^{\top}) \in \mathbb{R}^{C_s \times C_s}$ .
- Channel-attention output: $F_s^{\mathrm{CA}} = \beta F_s C_s^{\top} + F_s$ .
- Weight-excitation: $F_s^{\mathrm{WE}}=$ two-layer MLP applied to globally averaged features.
- CAWE fusion: $F_s^{\mathrm{CAWE}} = F_s^{\mathrm{CA}} + F_s^{\mathrm{WE}}$ .
CycleMLP Layer:
- Feature maps $X$ undergo cyclic shifts along height and width. Linear projections $(\mathrm{LP}^{(h)}, \mathrm{LP}^{(w)})$ are applied to flattened, shifted feature maps.
- Outputs are combined: $Z = Z^{(h)} + Z^{(w)} + Z^{(p)}$ , followed by a sigmoid.
Skip Feature Refinement:
- The final skip feature is $F_s^{\mathrm{skip}} = \sigma(\mathrm{CycleMLP}(F_s^{\mathrm{CAWE}}))$ .

Each CCM block operates in $O(N_s C_s)$ , i.e., $O(HW\,C)$ for the entire network, ensuring scalability.

5. Quantitative and Qualitative Evaluation

5.1 Benchmarks and Metrics

Datasets: ISIC (dermoscopic), BUSI (breast ultrasound), and ACDC (cardiac MRI).
Metrics: F1, IoU, and Dice Similarity Coefficient (DSC) across anatomical targets.
Training: AdamW optimizer, learning rate $2 \times 10^{-4}$ , 50 epochs (ISIC/BUSI), 129 epochs (ACDC), input $224\times224$ .

Table 2. Segmentation accuracy: U-CycleMLP vs. recent baselines

Method	ISIC F1	ISIC IoU	BUSI F1	BUSI IoU	ACDC Avg DSC
MD-UNet	91.58	84.81	-	-	-
FM-UNet	89.95	82.14	80.53	70.21	-
TransUNet	-	-	-	-	89.71
Swin-UNet	-	-	-	-	90.00
U-CycleMLP	92.76	86.50	84.85	78.86	91.11

U-CycleMLP achieves the highest F1/IoU on ISIC and BUSI, as well as best average DSC and myocardium segmentation accuracy on ACDC.

5.2 Qualitative Analysis

U-CycleMLP exhibits improved edge alignment, reducing false positives/negatives at irregular lesion boundaries (ISIC/BUSI) and maintaining precise ventricular and myocardial boundaries in low-contrast cardiac MRI slices (ACDC).

6. Ablation Studies and Component Effectiveness

Ablation experiments on ACDC (DSC) establish the importance of key components:

Table 3. Ablation: Impact of architectural modules (ACDC Avg DSC)

Configuration	RV	Myo	LV	Avg
w/o CCM	88.08	87.55	95.07	90.23
with CCM	89.28	88.44	95.63	91.11
Patch-Merging down	79.59	82.19	75.04	78.94
Max-pooling down	89.28	88.44	95.63	91.11
Bilinear upsampling	80.53	79.00	83.22	80.91
Patch-Expanding up	88.74	89.61	94.18	90.84
Transposed Conv up	89.28	88.44	95.63	91.11

Findings:

Incorporating CCM yields an ≈0.9% improvement in average DSC.
Max-pooling surpasses Patch-Merging by over 12% average DSC.
Transposed convolution offers better accuracy than bilinear and patch-expanding upsampling methods.

7. Model Complexity and Efficiency

Table 4. Model complexity ( $224 \times 224$ input)

Model	Parameters (M)	FLOPs (G)
UNet	31	104
TransUNet	96	250
U-CycleMLP	25	80

U-CycleMLP is more parameter- and compute-efficient than UNet and TransUNet, attributed to its linear-complexity CycleMLP and the use of efficient DA blocks.

U-CycleMLP demonstrates a principled approach to bridging the gap between spatially localized feature learning and global context aggregation in medical image segmentation. Its architectural innovations—particularly the combination of PAWE/DA blocks and CCM skip refinement—enable state-of-the-art segmentation performance across diverse modalities while maintaining a lightweight, computation-efficient profile (Alzu'bi et al., 6 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Bridging spatial awareness and global context in medical image segmentation (2025)

U-CycleMLP: Efficient Medical Segmentation

1. Motivation and Core Challenges

2. Architecture Overview and Components

Table 1. High-level architectural operations by stage

3. Spatial and Contextual Feature Extraction

3.1 Position Attention Weight Excitation (PAWE)

3.2 Dense Atrous (DA) Blocks

4. CycleMLP-based Skip-Feature Fusion

4.1 Channel CycleMLP (CCM) Block

5. Quantitative and Qualitative Evaluation

5.1 Benchmarks and Metrics

Table 2. Segmentation accuracy: U-CycleMLP vs. recent baselines

5.2 Qualitative Analysis

6. Ablation Studies and Component Effectiveness

Table 3. Ablation: Impact of architectural modules (ACDC Avg DSC)

7. Model Complexity and Efficiency

Table 4. Model complexity ( $224 \times 224$ input)

Whiteboard

Follow Topic

Continue Learning

U-CycleMLP: Efficient Medical Segmentation

1. Motivation and Core Challenges

2. Architecture Overview and Components

Table 1. High-level architectural operations by stage

3. Spatial and Contextual Feature Extraction

3.1 Position Attention Weight Excitation (PAWE)

3.2 Dense Atrous (DA) Blocks

4. CycleMLP-based Skip-Feature Fusion

4.1 Channel CycleMLP (CCM) Block

5. Quantitative and Qualitative Evaluation

5.1 Benchmarks and Metrics

Table 2. Segmentation accuracy: U-CycleMLP vs. recent baselines

5.2 Qualitative Analysis

6. Ablation Studies and Component Effectiveness

Table 3. Ablation: Impact of architectural modules (ACDC Avg DSC)

7. Model Complexity and Efficiency

Table 4. Model complexity (224×224224 \times 224224×224 input)

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics

Table 4. Model complexity ( $224 \times 224$ input)