SA-UNetv2: Efficient Retinal Segmentation
- The paper introduces cross-scale spatial attention in all skip connections, boosting segmentation accuracy on DRIVE and STARE datasets.
- SA-UNetv2 refines the convolutional unit with GroupNorm and SiLU activation, reducing parameters by over 50% for efficient deployment.
- A compound loss combining weighted BCE and differentiable MCC effectively addresses foreground-background imbalance, improving vessel sensitivity.
SA-UNetv2 is a lightweight convolutional neural network architecture for retinal vessel segmentation, designed for robust performance and deployability in resource-constrained, CPU-only clinical environments. Building upon limitations identified in SA-UNet, including spatial attention underutilization and inadequate foreground-background class balance, SA-UNetv2 introduces cross-scale spatial attention in all skip connections and a compound loss function that targets the challenges inherent in the vessel segmentation task. The method achieves state-of-the-art segmentation accuracy on the DRIVE and STARE datasets with greatly reduced memory and computation requirements, enabling practical deployment in clinical diagnostics.
1. Architectural Innovations
SA-UNetv2 incorporates several architectural changes with the goal of optimizing retinal vessel segmentation in terms of effectiveness and computational efficiency:
- Core Convolutional Unit Modification: The basic convolutional building block transitions from the SA-UNet configuration (Conv 3×3 → DropBlock → BatchNorm → ReLU) to Conv 3×3 → DropBlock → GroupNorm → SiLU activation. Group Normalization mitigates issues associated with small batch sizes commonly encountered in medical imaging, whereas the SiLU (sigmoid-weighted linear unit) improves gradient flow for fine vessel representation.
- Feature Channel Compression: The channel progression [16, 32, 64, 128] is restructured to [16, 32, 48, 64], yielding a parameter count reduction from 0.54M (SA-UNet) to 0.26M (<50% of SA-UNet). This compression halves overall memory overhead (1.2MB model size) without discernibly degrading expressivity for multi-scale features.
- Cross-scale Spatial Attention (CSA) in Skip Connections: Unlike SA-UNet, which restricted spatial attention to the bottleneck, SA-UNetv2 introduces CSA modules on every skip pathway. Each CSA module fuses encoder () and decoder () features by channel-wise average pooling, concatenation, a 7×7 convolution, and sigmoid activation to produce an attention map:
This mechanism produces skip connection features that are adaptively reweighted for both global and fine-grained vessel information transfer throughout the network hierarchy.
Feature | SA-UNet | SA-UNetv2 |
---|---|---|
Attention Use | Bottleneck only | All skip connections |
Norm+Activation | BatchNorm + ReLU | GroupNorm + SiLU |
Channel config | 16/32/64/128 | 16/32/48/64 |
Params (M) | 0.54 | 0.26 |
2. Cross-Scale Spatial Attention Mechanism
The CSA module in every skip connection enhances both multi-scale and semantic feature fusion. Encoder and decoder representations undergo channel-wise average pooling ( and ), concatenation, convolution (), and nonlinearity () to form a spatial attention map. Multiplication with the original encoder feature () delivers contextually adaptive skip features ().
This approach enables the network to prioritize vascular regions and suppress distractors across scales, improving performance in delineating thin and faint vessels that standard skip connections in vanilla U-Nets or SA-UNet struggle to represent. A plausible implication is that CSA yields heightened vessel continuity and completeness in segmentation maps, as evidenced by superior F1 and Jaccard scores on benchmark datasets.
3. Loss Functions for Foreground-Background Imbalance
To address foreground-background imbalance prevalent in retinal vessel segmentation, SA-UNetv2 employs a compound loss:
- Weighted Binary Cross-Entropy (BCE):
With as ground truth and as the predicted probability; is the pixel count.
- Differentiable Matthews Correlation Coefficient (MCC):
where the confusion matrix terms are defined over soft-predicted probabilities, with as a small stabilization constant.
The total loss is:
Empirical results favored for optimal trade-off between vessel sensitivity and general segmentation accuracy. This dual-objective regularizes both pixel-wise and global structure, improving robustness to small, thin vessel classes that are easily overwhelmed by the dominant background class.
4. Performance Analysis
SA-UNetv2 was evaluated on the DRIVE and STARE datasets using images of size pixels. Key results include:
Metric | DRIVE | STARE |
---|---|---|
F1 Score | 82.82 | 82.81 |
Jaccard Index | 70.69 | 70.82 |
Sensitivity | 83.64 | --- |
Specificity | 98.28 | --- |
Accuracy | 96.98 | --- |
MCC | 81.27 | 81.79 |
AUC | 98.71 | --- |
Compared to SA-UNet and alternative U-Net variants (e.g., PA-Filter), SA-UNetv2 achieves consistent gains in main segmentation metrics including F1 score, MCC, and Jaccard Index, establishing state-of-the-art performance. The combination of more targeted spatial attention and loss-induced class balance yields improved sensitivity in tiny vessel regions, addressing persistent limitations in prior architectures.
5. Computational Efficiency and Clinical Deployability
SA-UNetv2 is distinguished by its suitability for deployment without GPU acceleration:
- Memory Footprint: 1.2MB (0.26M parameters), less than half of SA-UNet.
- Computation: 21.19 GFLOPs per inference, down from 26.54 GFLOPs in SA-UNet.
- Inference Speed: 0.95 seconds per image on standard CPU hardware.
This efficiency enables real-time or near-real-time segmentation in clinical settings, including portable screening devices and point-of-care tools. These properties make SA-UNetv2 highly adaptable to low-resource contexts where GPU hardware may be unavailable.
6. Applications in Medical Image Analysis
The architecture’s principal clinical role is precise retinal vessel segmentation, pivotal for:
- Early diagnosis of diabetic retinopathy, hypertension, and neurodegenerative disorders: Accurate segmentation allows for longitudinal tracking of vessel morphology, supporting pre-symptomatic screening and risk stratification.
- Automated extraction of geometric/morphological features: Enables quantitative assessment of vessel width, tortuosity, and branching for diagnostic decision support.
- Integration into portable health devices: Its lightweight and efficient design suits use in mobile screening applications and environments with limited computational infrastructure.
This suggests broader utility for segmentation tasks suffering from foreground-background imbalance, multi-scale structure, or limited training data—subject to adaptation of the core architecture and loss design.
7. Context and Significance Within SA-UNet Developments
SA-UNetv2 is a direct evolution of SA-UNet (Guo et al., 2020), which introduced spatial attention and structured dropout for lightweight segmentation. Earlier improvements, including StyleGAN2-based synthetic data augmentation (Potesman et al., 2023), established the benefit of attention modules and robustness enhancements. By explicitly redefining skip connection attention and loss composition, SA-UNetv2 overcomes SA-UNet’s scope limitations and advances the state of the art for both accuracy and efficiency in retinal vessel segmentation (Guo et al., 15 Sep 2025).
A plausible future implication is that the paradigm established by SA-UNetv2’s CSA and MCC-regularized segmentation could generalize to other domains where robust, interpretable, and efficient image segmentation is required under operational and annotation constraints.