U-CycleMLP: Efficient Medical Segmentation
- The paper introduces U-CycleMLP, a U-shaped encoder–decoder architecture that combines dense atrous convolutions with CycleMLP-based skip refinement to enhance segmentation precision.
- It integrates spatial attention (PAWE) with global context modeling (CCM) to accurately delineate fine boundaries and reduce common segmentation errors in medical images.
- U-CycleMLP achieves state-of-the-art performance with improved F1, IoU, and DSC metrics while maintaining computational efficiency compared to traditional UNet and Transformer models.
U-CycleMLP is a U-shaped encoder–decoder architecture for medical image segmentation that integrates spatially aware local feature extraction with efficient global context modeling. The architecture is designed to enhance segmentation accuracy, particularly in delineating fine-grained boundaries, while maintaining computational efficiency through a combination of novel attention mechanisms, dense atrous convolutions, linear-complexity MLP-based modules, and optimal sampling strategies (Alzu'bi et al., 6 Dec 2025).
1. Motivation and Core Challenges
In medical image segmentation, precise boundary detection (spatial awareness) and the integration of distant anatomical information (global dependencies) are critical. Standard convolutional networks often exhibit limited receptive fields, resulting in suboptimal modeling of long-range spatial dependencies and frequent boundary pixel loss, manifesting as over- or under-segmentation in challenging regions such as thin or low-contrast anatomical structures. Transformer-based models improve global context aggregation but incur high computational costs due to quadratic complexity. Lightweight MLP-based architectures, though computationally efficient, often lack spatial inductive bias, undermining edge precision.
U-CycleMLP addresses these issues by combining:
- Effective local feature extraction via densely connected and atrous convolutions.
- Linear-complexity global modeling through channel-wise CycleMLP modules along skip connections.
- Attention mechanisms tailored for both spatial and channel dimensions.
2. Architecture Overview and Components
U-CycleMLP adheres to a five-stage U-shaped encoder–decoder configuration. Encoder and decoder stages are symmetric, with the following structure:
- Encoder: At each stage , the encoder produces feature maps , where and . The encoder incorporates Position Attention Weight Excitation (PAWE) blocks for spatial attention, and Dense Atrous (DA) blocks for multiscale context, interleaved with max-pooling downsampling.
- Decoder: Features are reconstructed to full resolution using transposed convolutions (upsampling), DA blocks for multiscale integration, and feature fusion. At each stage, the skip connection features are refined via Channel CycleMLP (CCM) blocks before concatenation with the upsampled decoder output.
Table 1. High-level architectural operations by stage
| Module | Encoder | Decoder |
|---|---|---|
| Attention Block | PAWE | CCM on skip-connection |
| Context Aggregation | Dense Atrous (DA) | Dense Atrous (DA) |
| Sampling Operation | Max-Pooling (stride 2) | Transposed Conv (stride 2) |
| Feature Fusion | - | Concatenation + DA |
3. Spatial and Contextual Feature Extraction
3.1 Position Attention Weight Excitation (PAWE)
PAWE combines spatial self-attention with weight excitation. The core operations are:
- (local features) are mapped to , , via convolutions.
- The spatial affinity map is , capturing pixel-wise contextual relevance.
- The position-attention output is , with learnable.
- Weight excitation, , is generated using global average pooling followed by two fully connected layers and a sigmoid activation.
- The fused output is .
3.2 Dense Atrous (DA) Blocks
Each DA block fuses:
- Dense convolutions for local feature reuse: , stacking convolution, batch normalization, and activation over layers.
- Atrous convolutions extend the receptive field without additional parameters: the effective field for a kernel at dilation is .
- The DA output at encoder stage is .
Max-pooling is used for downsampling based on ablation showing average DSC improvement over Patch-Merging. Decoder upsampling uses transposed convolutions.
4. CycleMLP-based Skip-Feature Fusion
4.1 Channel CycleMLP (CCM) Block
Introduced along each skip connection, CCM blocks refine feature integration with linear computational complexity. The CCM consists of:
- Channel Attention Weight Excitation (CAWE):
- Channel affinity: .
- Channel-attention output: .
- Weight-excitation: two-layer MLP applied to globally averaged features.
- CAWE fusion: .
- CycleMLP Layer:
- Feature maps undergo cyclic shifts along height and width. Linear projections are applied to flattened, shifted feature maps.
- Outputs are combined: , followed by a sigmoid.
- Skip Feature Refinement:
- The final skip feature is .
Each CCM block operates in , i.e., for the entire network, ensuring scalability.
5. Quantitative and Qualitative Evaluation
5.1 Benchmarks and Metrics
- Datasets: ISIC (dermoscopic), BUSI (breast ultrasound), and ACDC (cardiac MRI).
- Metrics: F1, IoU, and Dice Similarity Coefficient (DSC) across anatomical targets.
- Training: AdamW optimizer, learning rate , 50 epochs (ISIC/BUSI), 129 epochs (ACDC), input .
Table 2. Segmentation accuracy: U-CycleMLP vs. recent baselines
| Method | ISIC F1 | ISIC IoU | BUSI F1 | BUSI IoU | ACDC Avg DSC |
|---|---|---|---|---|---|
| MD-UNet | 91.58 | 84.81 | - | - | - |
| FM-UNet | 89.95 | 82.14 | 80.53 | 70.21 | - |
| TransUNet | - | - | - | - | 89.71 |
| Swin-UNet | - | - | - | - | 90.00 |
| U-CycleMLP | 92.76 | 86.50 | 84.85 | 78.86 | 91.11 |
U-CycleMLP achieves the highest F1/IoU on ISIC and BUSI, as well as best average DSC and myocardium segmentation accuracy on ACDC.
5.2 Qualitative Analysis
U-CycleMLP exhibits improved edge alignment, reducing false positives/negatives at irregular lesion boundaries (ISIC/BUSI) and maintaining precise ventricular and myocardial boundaries in low-contrast cardiac MRI slices (ACDC).
6. Ablation Studies and Component Effectiveness
Ablation experiments on ACDC (DSC) establish the importance of key components:
Table 3. Ablation: Impact of architectural modules (ACDC Avg DSC)
| Configuration | RV | Myo | LV | Avg |
|---|---|---|---|---|
| w/o CCM | 88.08 | 87.55 | 95.07 | 90.23 |
| with CCM | 89.28 | 88.44 | 95.63 | 91.11 |
| Patch-Merging down | 79.59 | 82.19 | 75.04 | 78.94 |
| Max-pooling down | 89.28 | 88.44 | 95.63 | 91.11 |
| Bilinear upsampling | 80.53 | 79.00 | 83.22 | 80.91 |
| Patch-Expanding up | 88.74 | 89.61 | 94.18 | 90.84 |
| Transposed Conv up | 89.28 | 88.44 | 95.63 | 91.11 |
Findings:
- Incorporating CCM yields an ≈0.9% improvement in average DSC.
- Max-pooling surpasses Patch-Merging by over 12% average DSC.
- Transposed convolution offers better accuracy than bilinear and patch-expanding upsampling methods.
7. Model Complexity and Efficiency
Table 4. Model complexity ( input)
| Model | Parameters (M) | FLOPs (G) |
|---|---|---|
| UNet | 31 | 104 |
| TransUNet | 96 | 250 |
| U-CycleMLP | 25 | 80 |
U-CycleMLP is more parameter- and compute-efficient than UNet and TransUNet, attributed to its linear-complexity CycleMLP and the use of efficient DA blocks.
U-CycleMLP demonstrates a principled approach to bridging the gap between spatially localized feature learning and global context aggregation in medical image segmentation. Its architectural innovations—particularly the combination of PAWE/DA blocks and CCM skip refinement—enable state-of-the-art segmentation performance across diverse modalities while maintaining a lightweight, computation-efficient profile (Alzu'bi et al., 6 Dec 2025).