SA-UNetv2: Efficient Retinal Segmentation

Updated 21 September 2025

The paper introduces cross-scale spatial attention in all skip connections, boosting segmentation accuracy on DRIVE and STARE datasets.
SA-UNetv2 refines the convolutional unit with GroupNorm and SiLU activation, reducing parameters by over 50% for efficient deployment.
A compound loss combining weighted BCE and differentiable MCC effectively addresses foreground-background imbalance, improving vessel sensitivity.

SA-UNetv2 is a lightweight convolutional neural network architecture for retinal vessel segmentation, designed for robust performance and deployability in resource-constrained, CPU-only clinical environments. Building upon limitations identified in SA-UNet, including spatial attention underutilization and inadequate foreground-background class balance, SA-UNetv2 introduces cross-scale spatial attention in all skip connections and a compound loss function that targets the challenges inherent in the vessel segmentation task. The method achieves state-of-the-art segmentation accuracy on the DRIVE and STARE datasets with greatly reduced memory and computation requirements, enabling practical deployment in clinical diagnostics.

1. Architectural Innovations

SA-UNetv2 incorporates several architectural changes with the goal of optimizing retinal vessel segmentation in terms of effectiveness and computational efficiency:

Core Convolutional Unit Modification: The basic convolutional building block transitions from the SA-UNet configuration (Conv 3×3 → DropBlock → BatchNorm → ReLU) to Conv 3×3 → DropBlock → GroupNorm → SiLU activation. Group Normalization mitigates issues associated with small batch sizes commonly encountered in medical imaging, whereas the SiLU (sigmoid-weighted linear unit) improves gradient flow for fine vessel representation.
Feature Channel Compression: The channel progression [16, 32, 64, 128] is restructured to [16, 32, 48, 64], yielding a parameter count reduction from 0.54M (SA-UNet) to 0.26M (<50% of SA-UNet). This compression halves overall memory overhead (1.2MB model size) without discernibly degrading expressivity for multi-scale features.
Cross-scale Spatial Attention (CSA) in Skip Connections: Unlike SA-UNet, which restricted spatial attention to the bottleneck, SA-UNetv2 introduces CSA modules on every skip pathway. Each CSA module fuses encoder ( $F^e$ ) and decoder ( $F^d$ ) features by channel-wise average pooling, concatenation, a 7×7 convolution, and sigmoid activation to produce an attention map:

$F^{(out)} = F^e \cdot \sigma( f^{7\times 7}( [\text{AvgPool}(F^e); \text{AvgPool}(F^d) ] ) )$

This mechanism produces skip connection features that are adaptively reweighted for both global and fine-grained vessel information transfer throughout the network hierarchy.

Feature	SA-UNet	SA-UNetv2
Attention Use	Bottleneck only	All skip connections
Norm+Activation	BatchNorm + ReLU	GroupNorm + SiLU
Channel config	16/32/64/128	16/32/48/64
Params (M)	0.54	0.26

2. Cross-Scale Spatial Attention Mechanism

The CSA module in every skip connection enhances both multi-scale and semantic feature fusion. Encoder and decoder representations undergo channel-wise average pooling ( $\text{AvgPool}(F^e)$ and $\text{AvgPool}(F^d)$ ), concatenation, convolution ( $f^{7\times 7}$ ), and nonlinearity ( $\sigma$ ) to form a spatial attention map. Multiplication with the original encoder feature ( $F^e$ ) delivers contextually adaptive skip features ( $F^{(out)}$ ).

This approach enables the network to prioritize vascular regions and suppress distractors across scales, improving performance in delineating thin and faint vessels that standard skip connections in vanilla U-Nets or SA-UNet struggle to represent. A plausible implication is that CSA yields heightened vessel continuity and completeness in segmentation maps, as evidenced by superior F1 and Jaccard scores on benchmark datasets.

3. Loss Functions for Foreground-Background Imbalance

To address foreground-background imbalance prevalent in retinal vessel segmentation, SA-UNetv2 employs a compound loss:

Weighted Binary Cross-Entropy (BCE):

$L_{BCE} = - \frac{1}{N} \sum_i [\; y_i \log(p_i) + (1 - y_i)\log(1 - p_i) \; ]$

With $y_i$ as ground truth and $p_i$ as the predicted probability; $N$ is the pixel count.

Differentiable Matthews Correlation Coefficient (MCC):

$L_{MCC} = 1 - \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})} + \varepsilon}$

where the confusion matrix terms are defined over soft-predicted probabilities, with $\varepsilon$ as a small stabilization constant.

The total loss is:

$L_{total} = \lambda_1 L_{BCE} + \lambda_2 L_{MCC}$

Empirical results favored $\lambda_1 = \lambda_2 = 0.5$ for optimal trade-off between vessel sensitivity and general segmentation accuracy. This dual-objective regularizes both pixel-wise and global structure, improving robustness to small, thin vessel classes that are easily overwhelmed by the dominant background class.

4. Performance Analysis

SA-UNetv2 was evaluated on the DRIVE and STARE datasets using images of size $592 \times 592 \times 3$ pixels. Key results include:

Metric	DRIVE	STARE
F1 Score	82.82	82.81
Jaccard Index	70.69	70.82
Sensitivity	83.64	---
Specificity	98.28	---
Accuracy	96.98	---
MCC	81.27	81.79
AUC	98.71	---

Compared to SA-UNet and alternative U-Net variants (e.g., PA-Filter), SA-UNetv2 achieves consistent gains in main segmentation metrics including F1 score, MCC, and Jaccard Index, establishing state-of-the-art performance. The combination of more targeted spatial attention and loss-induced class balance yields improved sensitivity in tiny vessel regions, addressing persistent limitations in prior architectures.

5. Computational Efficiency and Clinical Deployability

SA-UNetv2 is distinguished by its suitability for deployment without GPU acceleration:

Memory Footprint: 1.2MB (0.26M parameters), less than half of SA-UNet.
Computation: 21.19 GFLOPs per inference, down from 26.54 GFLOPs in SA-UNet.
Inference Speed: $\sim$ 0.95 seconds per $592 \times 592 \times 3$ image on standard CPU hardware.

This efficiency enables real-time or near-real-time segmentation in clinical settings, including portable screening devices and point-of-care tools. These properties make SA-UNetv2 highly adaptable to low-resource contexts where GPU hardware may be unavailable.

6. Applications in Medical Image Analysis

The architecture’s principal clinical role is precise retinal vessel segmentation, pivotal for:

Early diagnosis of diabetic retinopathy, hypertension, and neurodegenerative disorders: Accurate segmentation allows for longitudinal tracking of vessel morphology, supporting pre-symptomatic screening and risk stratification.
Automated extraction of geometric/morphological features: Enables quantitative assessment of vessel width, tortuosity, and branching for diagnostic decision support.
Integration into portable health devices: Its lightweight and efficient design suits use in mobile screening applications and environments with limited computational infrastructure.

This suggests broader utility for segmentation tasks suffering from foreground-background imbalance, multi-scale structure, or limited training data—subject to adaptation of the core architecture and loss design.

7. Context and Significance Within SA-UNet Developments

SA-UNetv2 is a direct evolution of SA-UNet (Guo et al., 2020), which introduced spatial attention and structured dropout for lightweight segmentation. Earlier improvements, including StyleGAN2-based synthetic data augmentation (Potesman et al., 2023), established the benefit of attention modules and robustness enhancements. By explicitly redefining skip connection attention and loss composition, SA-UNetv2 overcomes SA-UNet’s scope limitations and advances the state of the art for both accuracy and efficiency in retinal vessel segmentation (Guo et al., 15 Sep 2025).

A plausible future implication is that the paradigm established by SA-UNetv2’s CSA and MCC-regularized segmentation could generalize to other domains where robust, interpretable, and efficient image segmentation is required under operational and annotation constraints.

PDF Markdown Chat (Pro)

References (3)

SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation (2020)

SA Unet Improved (2023)

SA-UNetv2: Rethinking Spatial Attention U-Net for Retinal Vessel Segmentation (2025)

SA-UNetv2: Efficient Retinal Segmentation

1. Architectural Innovations

2. Cross-Scale Spatial Attention Mechanism

3. Loss Functions for Foreground-Background Imbalance

4. Performance Analysis

5. Computational Efficiency and Clinical Deployability

6. Applications in Medical Image Analysis

7. Context and Significance Within SA-UNet Developments

Whiteboard

Follow Topic

Continue Learning

SA-UNetv2: Efficient Retinal Segmentation

1. Architectural Innovations

2. Cross-Scale Spatial Attention Mechanism

3. Loss Functions for Foreground-Background Imbalance

4. Performance Analysis

5. Computational Efficiency and Clinical Deployability

6. Applications in Medical Image Analysis

7. Context and Significance Within SA-UNet Developments

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics