Recurrent Residual Blocks (RRCU/R2CL)

Updated 10 December 2025

Recurrent Residual Blocks (RRCU/R2CL) are convolutional units that combine recurrence and residual connections to enhance context aggregation and gradient flow.
They integrate feed-forward and recurrent convolutions in both 2D and 3D settings, enabling iterative feature refinement in segmentation and video analysis tasks.
Empirical studies show that using RRCU blocks in U-Net variants leads to measurable improvements, such as higher IoU and reduced error rates in action recognition and medical image segmentation.

Recurrent Residual Blocks (RRCU/R2CL) are architectural primitives that fuse recurrence and residual learning within a convolutional framework. These blocks generalize spatial residual learning to spatio-temporal or iterative contexts by introducing recurrent computations and explicit identity short-cuts. Designed for both video modeling and medical image segmentation, RRCU/R2CL blocks enhance representational power, facilitate deeper networks without gradient degradation, and enable efficient context aggregation across time or iterative steps. Their implementation spans 2D and 3D ConvNets, frequently within U-Net–derived architectures, and have demonstrated empirical improvements over classical convolutional, recurrent, and standard residual models.

1. Mathematical Formulation of RRCU/R2CL

A Recurrent Residual Convolutional Unit (RRCU) combines two mechanisms: recurrence over time or pseudo-time, and a residual identity shortcut. The canonical RRCU unfolds for a fixed number of steps $T$ , sharing weights across iterations.

For a 2D RRCU as used in segmentation networks (Mubashar et al., 2022, Dutta, 2021):

$\begin{aligned} h^{(0)} &= W^f * x \ h^{(t)} &= W^f * x + W^r * h^{(t-1)}, \quad t=1,\ldots,T \ o^{(t)} &= \mathrm{ReLU}\left(h^{(t)}\right) \end{aligned}$

The final output of the block is

$y = x + o^{(T)}$

Similarly, in 3D settings (Kadia et al., 2021), the block extends to volumetric tensors: $\begin{aligned} h^0 &= 0 \ h^t &= \sigma(W_x \ast x + W_h \ast h^{t-1} + b), \quad t=1,\ldots,T \ y &= h^T + x \end{aligned}$

In video action recognition (Iqbal et al., 2017), a temporal skip connection adds previous frame activations, optionally transformed by 1×1 convolutions, to the current block output: $\begin{aligned} y_t &= \sigma(x_t * W) + x_t + x_{t-1} \ y_t &= \sigma(x_t * W) + x_t + x_{t-1} * W_s \ y_t &= \sigma(x_t * W) + x_t + \sigma(x_{t-1} * W_s) \end{aligned}$ where the variants reflect identity, linear, and non-linear temporal skip connections.

2. Internal Architecture and Recurrence

Each RRCU block typically consists of:

Feed-forward convolution: $W^f$ mapping the input $x$ spatially (kernel size $3\times3$ or $3\times3\times3$ )
Recurrent convolution: $W^r$ mapping previous hidden activations, sharing weights across $T$ steps
Nonlinearity: ReLU activation after each step; BatchNorm may be included post-convolution (mandatory in some segmentation models)
Residual connection: Addition of the block's initial input feature map to the final output after $T$ recurrent steps

Most implementations use $T=2$ for segmentation (Mubashar et al., 2022, Dutta, 2021), and $T=3$ for volumetric tasks (Kadia et al., 2021). Channel dimension alignment throughout ensures that addition operations are dimensionally valid.

In segmentation U-Nets, the RRCU replaces the traditional stack of two independent convolutional layers, deepening the effective path without multiplying parameters due to weight-sharing, and stabilizing training via explicit residual additions (Mubashar et al., 2022, Dutta, 2021, Kadia et al., 2021).

3. Integration Into Network Architectures

RRCU/R2CL blocks are deployed in several network families:

Segmentation U-Nets: Replace vanilla convolutional blocks in encoder and decoder with RRCUs, integrate dense skip connections for semantic gap reduction (e.g., R2U++, Dense R2UNet) (Mubashar et al., 2022, Dutta, 2021)
3D Segmentation Networks: RRCUs extend to three spatial dimensions for volumetric segmentation tasks, with skip connections operating across corresponding encoder-decoder levels (Kadia et al., 2021)
Action Recognition: Recurrent residual blocks add temporal context via skips from previous frames inside ResNet blocks, without introducing gating mechanisms (LSTM/GRU) (Iqbal et al., 2017)
Sequence Modeling (Optical Music Recognition): Stacked RRCU blocks precede sequence heads (Bi-LSTM), enriching spatial context representations before temporal decoding (Liu et al., 2020)

Dense concatenations may be used both within each block (accumulating outputs of all recurrent steps) and across blocks at the same spatial resolution, further strengthening gradient flow and feature propagation (Dutta, 2021, Mubashar et al., 2022).

Implementation details:

Paper	Recurrence T	Kernel Size	BatchNorm	Residual Skip	Dense Connections
R2U++ (Mubashar et al., 2022)	2	3×3	Yes	Yes	Yes
Dense R2UNet (Dutta, 2021)	2	3×3	Optional	Yes	Yes
R2U3D (Kadia et al., 2021)	3	3×3×3	No	Yes	No
R2-CRNN (Liu et al., 2020)	2 (per RCU)	1×1, 3×3	Yes	Yes	No
Action Recog. (Iqbal et al., 2017)	1–5 skips	1×1, 3×3	Yes	Temporal	No

RRCUs generalize:

Residual Blocks (ResNet): Classic block adds input to output after two spatial convolutions. RRCU introduces recurrent substructure and additional skip mechanisms, allowing iterative refinement.
Recurrent Convolutional Layers (RCL): Implement recurrence per spatial location, but lack explicit residual connection, making training deeper RCL stacks more prone to vanishing gradients (Mubashar et al., 2022, Dutta, 2021).
Fully Recurrent Networks (LSTM/GRU): Gating and memory mechanisms are replaced by direct skip connections and convolutional updates; the temporal context window is bounded by explicit skip length (Iqbal et al., 2017).

The addition of residual connection after recurrent steps is shown to enhance gradient flow, impede degradation, and allow stacking without instability or parameter explosion (Liu et al., 2020, Dutta, 2021). Dense connection patterns promote multiple information pathways, enabling stronger feature reuse and efficient long-range propagation.

5. Empirical Impact and Evaluation

RRCU/R2CL blocks have consistently yielded measurable improvements in multiple benchmarks:

Action Recognition (Iqbal et al., 2017): A single temporal identity skip in Block 4 of ResNet-50 lowered test error to 0.197 (versus baseline 0.236), outperforming GRU and purely spatial architectures. Best results occurred with identity mapping and moderate temporal context (T=2–3), giving a 17% relative gain.
Segmentation (2D, R2U++ (Mubashar et al., 2022), Dense R2UNet (Dutta, 2021)): R2U++ improved mean IoU by 1.5±0.37% and mean Dice by 0.9±0.33% over U-Net++. Dense R2UNet further promoted feature propagation, enhancing segmentation accuracy, especially for thin structures (blood vessels, lung borders).
Volumetric Segmentation (R2U3D (Kadia et al., 2021)): Soft-DSC improved from 0.972 (V-Net) to 0.9920 on VESSEL12, and to 0.9859 on LUNA16, demonstrating superior 3D context agglomeration.
Optical Music Recognition (R2-CRNN (Liu et al., 2020)): Sequence Error Rate dropped from 95.1% to 20.9%, and Symbol Error Rate from 44.3% to 7.63%, when substituting classic CRNN with RRCU-based blocks.

Visualization experiments indicated sharper object boundaries, fewer false positives, and effective learning of multi-scale structures, with validation loss convergence accelerated by RRCU integration.

6. Implementation Characteristics and Computational Cost

RRCU/R2CL blocks maintain parameter efficiency by sharing weights across recurrence steps. For each block:

Parameters per block: $2 \times C^2 \times k^2$ for 2D (kernel 3×3), $2 \times C^2 \times 3^3$ for 3D (kernel 3×3×3)
FLOPs per block: scales as $T \times 2 \times C^2 \times k^2 \times H \times W$ (2D) or $T \times 2 \times C^2 \times 27 \times D \times H \times W$ (3D)
Most papers use Adam optimization and task-specific hybrid losses (cross-entropy plus Dice; see (Kadia et al., 2021) for full soft Dice coefficient and exponential logarithmic loss formulation)
RRCU-based U-Nets usually exhibit parameter counts comparable to or slightly above their non-recurrent counterparts (e.g., R2U++: ~18M vs. U-Net++: 9M) but yield superior performance

7. Context and Research Directions

RRCU/R2CL blocks have seen widespread adoption in advanced segmentation architectures, sequence modeling pipelines, and spatio-temporal networks, particularly where context aggregation and gradient stability are critical. No gating modules beyond skip connections are commonly employed; a plausible implication is that direct residual and recurrent mechanisms suffice for most practical tasks under moderate recurrence depth. Open directions include optimizing recurrence depth per layer, integrating attention or gating with RRCU for long-range context, and extending dense connectivity patterns for cross-modality tasks.

Misconceptions regarding RRCU involve conflating them with fully recurrent gated blocks (LSTM/GRU) or assuming parameter growth with increased recurrence; published implementations share weights across iterations and restrict temporal context for efficiency.

In sum, the fusion of recurrence and residual learning in convolutional units delivers effective context integration, deeper receptive fields, and stable training for both spatial and spatio-temporal deep networks, with empirical superiority over traditional convolutional and recurrent architectures in diverse applications (Iqbal et al., 2017, Mubashar et al., 2022, Liu et al., 2020, Dutta, 2021, Kadia et al., 2021).