GCA-ResUNet: Efficient Segmentation Framework
- GCA-ResUNet is a hybrid medical image segmentation framework that integrates grouped coordinate attention with a ResNet-50 U-Net to capture both local and global features.
- The architecture replaces standard convolution blocks with ResNet bottleneck modules augmented by GCA, enhancing boundary detection and long-range context modeling with minimal overhead.
- Experimental evaluations report a significant DSC gain over traditional methods, achieving state-of-the-art performance on Synapse and ACDC datasets.
GCA-ResUNet is a hybrid medical image segmentation framework that incorporates the Grouped Coordinate Attention (GCA) mechanism into a ResNet-50-based U-Net architecture. Designed for efficient, high-precision segmentation, GCA-ResUNet addresses the challenge of modeling long-range dependencies and boundary details in medical images, achieving significant accuracy improvements with minimal parameter and computational overhead compared to transformer-based or conventional self-attention approaches (Ding et al., 18 Nov 2025).
1. Architectural Overview
GCA-ResUNet maintains the encoder–decoder topology of classical U-Net but replaces conventional convolutional encoder blocks with ResNet-50 bottleneck modules augmented by GCA. The decoder uses simple bilinear upsampling and lightweight convolutional layers. The key architectural features are:
- Encoder: Begins with a 7×7 convolution (“stem”), followed by batch normalization, ReLU activation, max pooling, and four sequential ResNet-50 bottleneck stages (Layer1–Layer4). Each bottleneck integrates a GCA module directly after the final batch normalization and before residual addition.
- Skip Connections: Multi-scale hierarchical features are extracted after the stem and at the output of each encoder stage (feat1–feat5).
- Decoder: Implements four “UnetUp” modules. Each entails upsampling by a factor of two, concatenation with corresponding encoder features, and two 3×3 convolutional layers with ReLU activations. The output is produced by a 1×1 convolution projecting to segmentation classes.
| Stage | Operation | Output Shape |
|---|---|---|
| Input | Conv7×7, stride-2 → BN → ReLU → MaxPool | B×64×112×112 |
| Layer1 | 3× Bottleneck(64→256), +GCA | B×256×112×112 |
| Layer2 | 4× Bottleneck(256→512, stride-2), +GCA | B×512×56×56 |
| Layer3 | 6× Bottleneck(512→1024, stride-2), +GCA | B×1024×28×28 |
| Layer4 | 3× Bottleneck(1024→2048, stride-2), +GCA | B×2048×14×14 |
| DecoderUp1–4 | Upsample×2, Concat, 2×(3×3 Conv+ReLU) | B×128–1024×112×112–28×28 |
| Output | 1×1 Conv | B×K×224×224 |
This design combines strong residual local feature extraction with explicit large-range context awareness, enabling precise segmentation even in the presence of elongated or fine structures.
2. Grouped Coordinate Attention Mechanism
Grouped Coordinate Attention (GCA) is central to the model’s performance and efficiency. For an input tensor , channels are evenly split into groups ( per group). For each group:
- Directional Pooling: Extracts four spatial descriptors per group:
- : Average pooled along width .
- : Max pooled along width (same shape).
- : Average pooled along height .
- : Max pooled along height (same shape).
- Shared Bottleneck: Concatenate horizontal descriptors and vertical descriptors . Each is passed through a “bottleneck” of 1×1 convolutions ( to reduce, to restore channels), batch normalization, ReLU, and sigmoid activation.
- Attention Application: Generate horizontal and vertical attention maps, which are broadcast-multiplied to the group’s feature tensor , yielding . All are concatenated to form the full output.
This process injects explicit global context by enabling spatially- and channel-wise selective feature recalibration. The grouping ensures parameter and FLOP efficiency.
3. Training, Loss Functions, and Preprocessing
The model is trained using Adam with an initial learning rate of , batch size 8, from random initialization. The loss combines Dice Loss () with standard pixelwise cross-entropy ():
- Preprocessing: All images/slices are rescaled to via bilinear interpolation, with intensity normalization to or standardization. No explicit data augmentation is detailed.
- Resource Profile: Training and inference are conducted on inputs with peak GPU usage below 4GB; inference exceeds 20 FPS on RTX 4060 Ti.
4. Experimental Evaluation and Ablation
GCA-ResUNet is benchmarked on Synapse (multi-organ CT, 3,779 slices) and ACDC (cardiac MRI, 100 subjects). Metrics are average Dice Similarity Coefficient (DSC) per structure.
Quantitative Results
| Model | Synapse (DSC) | ACDC (DSC) |
|---|---|---|
| ResNet50 U-Net | 77.61 | 87.55 |
| MT-UNet | 78.59 | 90.43 |
| Swin-UNet | 79.13 | 90.00 |
| VM-UNet | 81.08 | -- |
| SelfReg-UNet | 80.54 | 91.49 |
| GCA-ResUNet | 86.11 | 92.64 |
GCA-ResUNet achieves a 5% absolute DSC gain over the best vanilla convolutional baseline (ResNet50 U-Net) on Synapse, and delivers superior segmentation of small/elongated anatomies.
Ablation Analyses
- Attention Type: GCA provides a 2–3% DSC improvement over SE (Squeeze-and-Excitation) and CBAM (Convolutional Block Attention Module).
- Group Count : Optimal at ; further subdividing increases computational cost with marginal or negative returns.
- Placement: Highest global DSC is achieved when GCA is applied to every bottleneck stage; limiting application to only shallow or deep blocks degrades boundary performance.
- Statistical Significance: Improvements are significant at (paired -test, folds); boundary F1-scores rise by 4–6%.
5. Comparative Computational Efficiency
GCA-ResUNet introduces approximately 0.5 million additional parameters and less than 10 GFLOPs overhead relative to ResNet-50 U-Net. Its grouped, coordinate-aware attention enables explicit long-range modeling at a fraction of the computational cost incurred by transformer-based or full self-attention architectures, making the model suitable for resource-constrained medical deployments.
6. Strengths, Limitations, and Prospects
The integration of GCA with ResUNet confers several properties:
- Strengths:
- Encodes spatial and channel-wise context, excelling at delineation of fine or poorly-contrasted boundaries.
- Robustness on small datasets, due to retained local convolutional bias.
- Deployability in modest hardware settings.
- Limitations:
- Attention is limited to coordinate axes; diagonal/curved-pooling could further enhance geometric modeling.
- The design targets 2D slices; extension to 3D volumes (3D-GCA) is an open direction.
- Data augmentation and semi/self-supervised strategies are not exploited in the baseline.
- Future Directions:
- Exploring richer pooling paths, unsupervised pretraining, and volumetric (3D) GCA modules may provide further improvements, especially for under-represented structures or limited data regimes.
7. Context and Significance
GCA-ResUNet advances semantic medical image segmentation by marrying efficient grouped attention with residual U-Net architectures. It demonstrates that lightweight, coordinate-aware attention can substitute for computationally intensive self-attention without forfeiting accuracy or deployment viability, as evidenced by state-of-the-art performance on major public datasets. A plausible implication is that similar grouped, axis-aware mechanisms could be adapted for other vision tasks demanding both global and local modeling under tight resource budgets (Ding et al., 18 Nov 2025).