Prior-AttUNet: OCT Fluid Segmentation

Updated 1 January 2026

The paper introduces a dual-path architecture that combines an IntroVAE-based normative prior pathway with a U-Net segmentation backbone to enhance boundary delineation.
It employs a novel triple-attention mechanism that fuses multi-scale encoder, decoder, and prior features to improve segmentation accuracy and cross-device robustness.
Experimental results on the RETOUCH benchmark demonstrate superior mDSC performance and reliable fluid segmentation, outperforming prior architectures like DAA-UNet.

Prior-AttUNet is a retinal optical coherence tomography (OCT) fluid segmentation architecture that integrates normative anatomical priors within a dual-path attention-gated design. The model specifically targets challenges in delineating ambiguous fluid boundaries and achieving robust cross-device generalization for critical pathologies such as macular edema. Prior-AttUNet introduces a generative normal anatomical pathway via a variational autoencoder and a U-Net–style segmentation network, fusing their multi-scale representations through a novel triple-attention mechanism.

1. Architectural Foundations

Prior-AttUNet adopts a hybrid dual-path architecture (Fig. 2 in (Yang et al., 25 Dec 2025)):

Generative Prior Pathway (“NORMNET”): Implements an IntroVAE trained exclusively on fluid-free OCT images. Given $x$ , the encoder %%%%1%%%% produces mean $\mu$ and log variance, from which latent $z$ is sampled. The decoder $p_\theta(x|z)$ reconstructs $x_\text{prior}$ , a high-quality normative OCT. Multi-scale feature extraction from $x_\text{prior}$ yields a set $\{p^\ell\}_{\ell=1..4}$ via a 4-stage encoder, aligning spatially through a symmetric decoder.
Segmentation Backbone: An encoder–decoder topology inspired by U-Net, incorporating:
- DenseDepthSepBlocks at each level: L=3 depthwise-separable convolutions with dense connectivity and 32 channels per layer (see Eq. (2) in (Yang et al., 25 Dec 2025)).
- Atrous Spatial Pyramid Pooling (ASPP) bottleneck: Four parallel $3\times3$ atrous convolutions with dilation rates $r\in\{1,6,12,18\}$ and a global max-pooling branch. Outputs are concatenated, compressed, and regularized (Fig. 4, Eq. (3–6)).
- Triple Attention Gates replacing standard skip connections, fusing encoder, decoder, and prior features (see §3.4, Fig. 6).

The decoder reconstructs the segmentation mask through upsampling, skip-attention fusion, concatenation, and final dense block processing, outputting via a sigmoid-activated $1\times1$ convolution.

2. Generative Anatomical Priors: IntroVAE

The normal anatomical prior branch is implemented as an IntroVAE:

Encoder: Learns $q_\phi(z|x)$ , outputting $\mu$ and $\log\sigma^2$ (Eq. (7)).
Latent Sampling: $z = \mu + \epsilon \cdot \exp(0.5 \log \sigma^2),\ \epsilon\sim\mathcal{N}(0,I)$ (Eq. (8)).
Decoder: Reconstructs $x_\text{rec} = p_\theta(x|z)$ (Eq. (9)).
Prior: Assumed $p(z)=\mathcal{N}(0,I)$ .
ELBO loss: $\mathcal{L}_\text{VAE} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \text{KL}[q_\phi(z|x)\,\|\,p(z)]$ (Eq. (12′)).

Multi-scale prior features $p^\ell$ are extracted using four-stage encoding (stride-2 convolutions) and symmetric decoding, spatially aligned to match encoder–decoder skip stages. These priors are supplied to the segmentation network as guidance in attention fusion.

3. Segmentation Pathway and Losses

The segmentation backbone, instantiated as a U-Net–style encoder–decoder, incorporates the following:

DenseDepthSepBlock: Each block contains $L=3$ depthwise-separable convolutions with dense inter-layer concatenation (Fig. 3, Eq. (1), (2)). Each layer adds 32 channels, facilitating feature reuse with minimal computational overhead.
ASPP: Four atrous convolutions and a global max-pooling path (Fig. 4, Eq. (3–6)) capture multi-scale cues critical for fluid segmentation across imaging devices.
Segmentation Loss Functions:
- Dice Loss:
$\mathcal{L}_\text{Dice} = 1 - \frac{2\sum_i P_i G_i + \epsilon}{\sum_i P_i + \sum_i G_i + \epsilon}$

(Eq. (17)), where $P_i$ and $G_i$ are prediction and ground-truth at voxel $i$ . - Lovász Loss: Surrogate for mean Intersection-over-Union (mIoU) across classes, $\mathcal{L}_\text{Lovász} = \sum_{c=1}^C \mathcal{L}_\text{Lovász}(m(c))$ (Eq. (18)). - Total Loss: $\mathcal{L}_\text{seg} = \mathcal{L}_\text{Dice} + \mathcal{L}_\text{Lovász}$ .

4. Triple-Attention Mechanism

Prior-AttUNet’s skip fusion is governed by a triple-attention gate per decoder stage:

Inputs: Encoder skip $f_e = f_\text{enc}^{\ell-1}$ , decoder state $f_d = f_\text{dec}^{\ell}$ , and anatomical prior $p = p^{\ell-1}$ .
Linear Projection: $X_1 = \text{BN}(\text{Conv}_{3\times3}(f_e))$ , $X_2 = \text{BN}(\text{Conv}_{3\times3}(f_d))$ , $X_3 = \text{BN}(\text{Conv}_{3\times3}(p))$ (Eq. (12)).
Fusion and Attention:
- $V = \text{ReLU}(X_1 + X_2 + X_3)$ (Eq. (13))
- Attention map $A = \sigma(V)$ (Eq. (14a))
- Decoder features modulated: $f_\text{out} = f_d \odot A$ (Eq. (14b))

This implements spatial, channel, and prior-guided attention, amplifying contrast at fluid–tissue borders and improving lesion delineation, especially at ambiguous boundaries. Subtraction and fusion in this mechanism are visualized in attention heatmaps (see Fig. 8 in (Yang et al., 25 Dec 2025)).

5. Training Protocols

VAE Training: Conducted on normal OCT slices; $\mathcal{L}_\text{VAE}$ minimized over 150 epochs. Best weights retained based on reconstruction fidelity. Learned weights are frozen before segmentation training.
Segmentation Training: Performed on all annotated slices (input $256\times256$ , batch=16, AdamW with $\text{lr}=10^{-4}$ , $\beta_1=0.9$ , $\beta_2=0.99$ , $\epsilon=10^{-8}$ ), for 150 epochs. No heavy augmentation reported.
Total Parameters: 47.04M; Compute: 0.37 TFLOPS at inference.

6. Experimental Evaluation

Evaluation on the public RETOUCH benchmark demonstrates:

Device	mDSC (%)	Precision/Recall	SD (mDSC)
Cirrus	93.93	–	±1.6
Spectralis	95.18	–	±0.3
Topcon	93.47	–	±0.3

Outperforms DAA-UNet by ≈3 percentage points in mDSC at significantly lower FLOPs.
Ablations (Spectralis, Table 3):
- Removal of normal prior: mDSC drops by 2.38%
- Removal of triple attention: mDSC drops by 1.11%
- Removal of ASPP: mDSC drops by 1.12%
- Removal of dense blocks: mDSC drops by 1.88%
Further ablations (Tables 4–7): IntroVAE priors are irreplaceable; alternative extractors or DS blocks underperform or are less efficient.
Inference: Real-time on RTX 4090.

7. Context, Robustness, and Significance

Boundary Delineation: The triple attention gate accentuates differential features between pathological and normative anatomy, critical in challenging fluid-vs-tissue transitions.
Cross-Device Robustness: Consistent mDSC and low inter-device variance across Cirrus, Spectralis, and Topcon are observed, verifying generalizability to intensity and structural differences common in multi-vendor clinical OCT.
Clinical Relevance: Achieves accuracy and efficiency suitable for integration into automated diagnostic pipelines.
Comparative Significance: By integrating multi-scale anatomical priors through novel attention fusion, Prior-AttUNet advances performance and efficiency on the RETOUCH OCT segmentation benchmark, providing a robust and generalizable solution for retina fluid analysis (Yang et al., 25 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Prior-AttUNet: Retinal OCT Fluid Segmentation Based on Normal Anatomical Priors and Attention Gating (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Prior-AttUNet.