Papers
Topics
Authors
Recent
2000 character limit reached

Prior-AttUNet: OCT Fluid Segmentation

Updated 1 January 2026
  • The paper introduces a dual-path architecture that combines an IntroVAE-based normative prior pathway with a U-Net segmentation backbone to enhance boundary delineation.
  • It employs a novel triple-attention mechanism that fuses multi-scale encoder, decoder, and prior features to improve segmentation accuracy and cross-device robustness.
  • Experimental results on the RETOUCH benchmark demonstrate superior mDSC performance and reliable fluid segmentation, outperforming prior architectures like DAA-UNet.

Prior-AttUNet is a retinal optical coherence tomography (OCT) fluid segmentation architecture that integrates normative anatomical priors within a dual-path attention-gated design. The model specifically targets challenges in delineating ambiguous fluid boundaries and achieving robust cross-device generalization for critical pathologies such as macular edema. Prior-AttUNet introduces a generative normal anatomical pathway via a variational autoencoder and a U-Net–style segmentation network, fusing their multi-scale representations through a novel triple-attention mechanism.

1. Architectural Foundations

Prior-AttUNet adopts a hybrid dual-path architecture (Fig. 2 in (Yang et al., 25 Dec 2025)):

  • Generative Prior Pathway (“NORMNET”): Implements an IntroVAE trained exclusively on fluid-free OCT images. Given xx, the encoder %%%%1%%%% produces mean μ\mu and log variance, from which latent zz is sampled. The decoder pθ(xz)p_\theta(x|z) reconstructs xpriorx_\text{prior}, a high-quality normative OCT. Multi-scale feature extraction from xpriorx_\text{prior} yields a set {p}=1..4\{p^\ell\}_{\ell=1..4} via a 4-stage encoder, aligning spatially through a symmetric decoder.
  • Segmentation Backbone: An encoder–decoder topology inspired by U-Net, incorporating:
    • DenseDepthSepBlocks at each level: L=3 depthwise-separable convolutions with dense connectivity and 32 channels per layer (see Eq. (2) in (Yang et al., 25 Dec 2025)).
    • Atrous Spatial Pyramid Pooling (ASPP) bottleneck: Four parallel 3×33\times3 atrous convolutions with dilation rates r{1,6,12,18}r\in\{1,6,12,18\} and a global max-pooling branch. Outputs are concatenated, compressed, and regularized (Fig. 4, Eq. (3–6)).
    • Triple Attention Gates replacing standard skip connections, fusing encoder, decoder, and prior features (see §3.4, Fig. 6).

The decoder reconstructs the segmentation mask through upsampling, skip-attention fusion, concatenation, and final dense block processing, outputting via a sigmoid-activated 1×11\times1 convolution.

2. Generative Anatomical Priors: IntroVAE

The normal anatomical prior branch is implemented as an IntroVAE:

  • Encoder: Learns qϕ(zx)q_\phi(z|x), outputting μ\mu and logσ2\log\sigma^2 (Eq. (7)).
  • Latent Sampling: z=μ+ϵexp(0.5logσ2), ϵN(0,I)z = \mu + \epsilon \cdot \exp(0.5 \log \sigma^2),\ \epsilon\sim\mathcal{N}(0,I) (Eq. (8)).
  • Decoder: Reconstructs xrec=pθ(xz)x_\text{rec} = p_\theta(x|z) (Eq. (9)).
  • Prior: Assumed p(z)=N(0,I)p(z)=\mathcal{N}(0,I).
  • ELBO loss: LVAE=Eqϕ(zx)[logpθ(xz)]+KL[qϕ(zx)p(z)]\mathcal{L}_\text{VAE} = -\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \text{KL}[q_\phi(z|x)\,\|\,p(z)] (Eq. (12′)).

Multi-scale prior features pp^\ell are extracted using four-stage encoding (stride-2 convolutions) and symmetric decoding, spatially aligned to match encoder–decoder skip stages. These priors are supplied to the segmentation network as guidance in attention fusion.

3. Segmentation Pathway and Losses

The segmentation backbone, instantiated as a U-Net–style encoder–decoder, incorporates the following:

  • DenseDepthSepBlock: Each block contains L=3L=3 depthwise-separable convolutions with dense inter-layer concatenation (Fig. 3, Eq. (1), (2)). Each layer adds 32 channels, facilitating feature reuse with minimal computational overhead.
  • ASPP: Four atrous convolutions and a global max-pooling path (Fig. 4, Eq. (3–6)) capture multi-scale cues critical for fluid segmentation across imaging devices.
  • Segmentation Loss Functions:
    • Dice Loss:

    LDice=12iPiGi+ϵiPi+iGi+ϵ\mathcal{L}_\text{Dice} = 1 - \frac{2\sum_i P_i G_i + \epsilon}{\sum_i P_i + \sum_i G_i + \epsilon}

    (Eq. (17)), where PiP_i and GiG_i are prediction and ground-truth at voxel ii. - Lovász Loss: Surrogate for mean Intersection-over-Union (mIoU) across classes, LLovaˊsz=c=1CLLovaˊsz(m(c))\mathcal{L}_\text{Lovász} = \sum_{c=1}^C \mathcal{L}_\text{Lovász}(m(c)) (Eq. (18)). - Total Loss: Lseg=LDice+LLovaˊsz\mathcal{L}_\text{seg} = \mathcal{L}_\text{Dice} + \mathcal{L}_\text{Lovász}.

4. Triple-Attention Mechanism

Prior-AttUNet’s skip fusion is governed by a triple-attention gate per decoder stage:

  • Inputs: Encoder skip fe=fenc1f_e = f_\text{enc}^{\ell-1}, decoder state fd=fdecf_d = f_\text{dec}^{\ell}, and anatomical prior p=p1p = p^{\ell-1}.

  • Linear Projection: X1=BN(Conv3×3(fe))X_1 = \text{BN}(\text{Conv}_{3\times3}(f_e)), X2=BN(Conv3×3(fd))X_2 = \text{BN}(\text{Conv}_{3\times3}(f_d)), X3=BN(Conv3×3(p))X_3 = \text{BN}(\text{Conv}_{3\times3}(p)) (Eq. (12)).

  • Fusion and Attention:

    • V=ReLU(X1+X2+X3)V = \text{ReLU}(X_1 + X_2 + X_3) (Eq. (13))
    • Attention map A=σ(V)A = \sigma(V) (Eq. (14a))
    • Decoder features modulated: fout=fdAf_\text{out} = f_d \odot A (Eq. (14b))

This implements spatial, channel, and prior-guided attention, amplifying contrast at fluid–tissue borders and improving lesion delineation, especially at ambiguous boundaries. Subtraction and fusion in this mechanism are visualized in attention heatmaps (see Fig. 8 in (Yang et al., 25 Dec 2025)).

5. Training Protocols

  • VAE Training: Conducted on normal OCT slices; LVAE\mathcal{L}_\text{VAE} minimized over 150 epochs. Best weights retained based on reconstruction fidelity. Learned weights are frozen before segmentation training.
  • Segmentation Training: Performed on all annotated slices (input 256×256256\times256, batch=16, AdamW with lr=104\text{lr}=10^{-4}, β1=0.9\beta_1=0.9, β2=0.99\beta_2=0.99, ϵ=108\epsilon=10^{-8}), for 150 epochs. No heavy augmentation reported.
  • Total Parameters: 47.04M; Compute: 0.37 TFLOPS at inference.

6. Experimental Evaluation

Evaluation on the public RETOUCH benchmark demonstrates:

Device mDSC (%) Precision/Recall SD (mDSC)
Cirrus 93.93 ±1.6
Spectralis 95.18 ±0.3
Topcon 93.47 ±0.3
  • Outperforms DAA-UNet by ≈3 percentage points in mDSC at significantly lower FLOPs.
  • Ablations (Spectralis, Table 3):
    • Removal of normal prior: mDSC drops by 2.38%
    • Removal of triple attention: mDSC drops by 1.11%
    • Removal of ASPP: mDSC drops by 1.12%
    • Removal of dense blocks: mDSC drops by 1.88%
  • Further ablations (Tables 4–7): IntroVAE priors are irreplaceable; alternative extractors or DS blocks underperform or are less efficient.
  • Inference: Real-time on RTX 4090.

7. Context, Robustness, and Significance

  • Boundary Delineation: The triple attention gate accentuates differential features between pathological and normative anatomy, critical in challenging fluid-vs-tissue transitions.
  • Cross-Device Robustness: Consistent mDSC and low inter-device variance across Cirrus, Spectralis, and Topcon are observed, verifying generalizability to intensity and structural differences common in multi-vendor clinical OCT.
  • Clinical Relevance: Achieves accuracy and efficiency suitable for integration into automated diagnostic pipelines.
  • Comparative Significance: By integrating multi-scale anatomical priors through novel attention fusion, Prior-AttUNet advances performance and efficiency on the RETOUCH OCT segmentation benchmark, providing a robust and generalizable solution for retina fluid analysis (Yang et al., 25 Dec 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prior-AttUNet.