Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recurrent Residual 3D U-Net (R2U3D)

Updated 10 February 2026
  • The paper introduces R2U3D, a novel volumetric segmentation architecture that integrates recurrent, residual, and channel recalibration mechanisms to capture long-range spatial dependencies in 3D CT scans.
  • It achieves state-of-the-art segmentation results with high Dice scores on lung CT benchmarks, even under limited training data conditions.
  • The dynamic variant reduces parameter count and enhances training efficiency through inception-like downsampling and SE blocks.

The Recurrent Residual 3D U-Net (R2U3D) is a volumetric segmentation architecture designed to leverage recurrent, residual, and Squeeze-and-Excitation mechanisms atop the U-Net paradigm with full 3D convolutions. It is motivated by the need to capture long-range spatial dependencies in volumetric medical scans, while maintaining parameter efficiency and robust training behavior. R2U3D achieves state-of-the-art segmentation performance for lung structures in chest computed tomography (CT) volumes, demonstrating high data efficiency and generalization in contexts with limited labeled samples (Kadia et al., 2021).

1. Network Architecture

R2U3D extends the standard U-Net topology into three dimensions and augments its constituent blocks with recurrent residual mechanisms. The architecture consists of an encoder (contracting path) and decoder (expanding path), each with four resolution levels. Feature-map spatial resolution is halved at each encoder stage using an Inception-like down-sampling module, in which parallel 3D convolutions (kernels: 1×1×11\times1\times1, 3×3×33\times3\times3, 5×5×55\times5\times5, stride 2) and a 2×2×22\times2\times2 max-pool are concatenated.

The internal processing at each level is via stacked Recurrent Residual Convolutional Units (RRCUs). These units are parameterized by the recurrent "depth" DD, which determines the number of sequential 3D convolutional passes applied to each input feature map at a given resolution. The network supports two principal variants:

  • Default R2U3D: Channel counts (f1,f2,f3,f4)=(40,80,160,320)(f_1, f_2, f_3, f_4)=(40, 80, 160, 320), recurrent depth set to 2 at all levels.
  • Dynamic R2U3D: Channel counts (20,60,120,240)(20, 60, 120, 240), with recurrent depth increasing from d1=1d_1=1 (top) to d4=4d_4=4 (bottom). Squeeze-and-Excitation (SE) blocks are placed after the recurrent convolution units.

Decoder up-sampling is implemented via 2×2×22\times2\times2 transposed convolutions (stride 2), with skip connections from encoder to decoder levels, followed by 1×1×11\times1\times1 projection convolutions and ReLU activations.

2. Recurrent Residual and Channel-wise Recalibration Mechanisms

Each RRCU processes input feature maps XX^\ell over DD recurrent steps via

Ht=σ(WX+UHt1+b),t=1,,DH_t = \sigma\left( W * X + U * H_{t-1} + b \right),\quad t=1,\ldots,D

where WW and UU are 3D convolutional kernels, * denotes 3D convolution, σ\sigma is the ReLU activation, and H0=0H_0=0. The block output is obtained by a residual addition:

Y=ReLU(X+HD)Y = \text{ReLU}(X + H_D)

The Dynamic-Recurrent variant (DRRCU) injects SE blocks after the recurrent convolutions. This involves a global spatial average pooling to extract a channel descriptor, two fully-connected layers with ReLU and Sigmoid activations to generate per-channel weights, and rescaling followed by a residual merge:

zk=1Ni,j,kXk(i,j,k),z_k = \frac{1}{N} \sum_{i,j,k} X_k(i,j,k),

s=σ2(W2(σ1(W1z))),s = \sigma_2(W_2(\sigma_1(W_1 z))),

X~k=skXk,Y=X+X~\tilde{X}_k = s_k \cdot X_k,\quad Y = X + \tilde{X}

These mechanisms enable volumetric context accumulation and dynamic channel-wise recalibration without increasing model depth or parameter count proportionally.

3. Mathematical Objectives and Loss Formulation

Segmentation accuracy is primarily measured using the Dice Similarity Coefficient (DSC), with the model trained using a differentiable "Soft-DSC" objective:

DSC=2ipigiipi+igi\text{DSC} = \frac{2 \sum_i p_i g_i}{\sum_i p_i + \sum_i g_i}

where pip_i is the predicted probability at voxel ii, gig_i is the ground-truth label. The loss combines an Exponential Logarithmic variant of the DSC with weighted cross-entropy (WCE):

Total Loss=wDSCLDSC+wCELCE\text{Total Loss} = w_{DSC} L_{DSC} + w_{CE} L_{CE}

LDSC=(ln(DSC))γ,LCE=WCE(Logit(p),g)L_{DSC} = \bigl(-\ln(\text{DSC})\bigr)^\gamma, \quad L_{CE} = \text{WCE}(\text{Logit}(p), g)

Default hyperparameters: wDSC=0.8w_{DSC}=0.8, wCE=0.2w_{CE}=0.2, γ=0.3\gamma=0.3.

4. Training Protocol and Dataset Regimen

The model is evaluated using two public datasets:

  • LUNA16: 888 CT volumes with lung masks (876 used; 700 train, 176 test).
  • VESSEL12: 20 CT volumes (testing only).

Preprocessing includes isotropic resampling to 256×512×512256\times512\times512, slice repetition/subsampling along the zz-axis, and intensity normalization to [0,1][0,1]. Batch size is 1 (GPU-constrained), using Adam optimization. Learning rates proceed in two phases: 1e31\mathrm{e}{-3} for 400 iterations; 1e41\mathrm{e}{-4} for an additional 100. No explicit data augmentation or weight regularization is applied apart from batch normalization inside SE blocks.

For scarce data, a "round-robin" sampling regime is adopted: train for 5 epochs on a random set of 5 scans and repeat for 500 iterations. This increases exposure to data diversity without increasing dataset size.

5. Experimental Evaluation and Comparisons

Performance is benchmarked against the extended V-Net baseline. R2U3D achieves superior Dice scores:

Test Set Method DSC
VESSEL12 (20 scans) Extended V-Net [6] 0.9870
VESSEL12 (20 scans) R2U3D (default) 0.9881
VESSEL12 (20 scans) R2U3D (dynamic) 0.9920
LUNA16 (176 test) R2U3D (default) 0.9831
LUNA16 (176 test) R2U3D (dynamic) 0.9859
LUNA16 (all, n=776) R2U3D (dynamic) 0.9828

Notably, with only 100 training scans (no data augmentation), R2U3D (dynamic) obtains DSC=0.9920 on VESSEL12. The network demonstrates consistent gains in DSC as more data is added, e.g., 0.98130.98270.9813 \to 0.9827 as the LUNA16 training subset increases from 100 to 700 scans.

Parameter efficiency is highlighted: default variant has $20.3$M parameters, dynamic variant reduces this to $12.95$M, achieving higher segmentation accuracy and improved training speed and memory utilization.

6. Significance and Limitations

R2U3D demonstrates that integrating recurrent and residual architectures within a 3D U-Net framework effectively enlarges the volumetric receptive field and supports gradient stability. Channel-wise recalibration via SE blocks yields further gains and parameter reduction. The architecture achieves state-of-the-art Soft-DSC on lung CT segmentation benchmarks without reliance on data augmentation or excessive regularization, suggesting strong robustness and generalization (Kadia et al., 2021).

No explicit inference speed or wall-clock timing results are reported. While the model is validated on chest CT datasets, generalization to other volumetric segmentation tasks is a plausible area for further investigation.

Compared to prior 3D segmentation architectures such as V-Net, R2U3D demonstrates improved data efficiency and representational power by explicitly modeling spatial dependencies through recurrent convolutions. The use of inception-inspired down-sampling and dynamic channel recalibration distinguishes the architecture from U-Net and V-Net derivatives and aligns with contemporary trends in efficient 3D medical imaging networks.

For full methodological and implementation details, refer to "R2U3D: Recurrent Residual 3D U-Net for Lung Segmentation" (Kadia et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Residual 3D U-Net (R2U3D).