Recurrent Layer Aggregation in CNNs

Updated 31 May 2026

Recurrent Layer Aggregation is a feature reuse mechanism that integrates outputs from previous CNN layers using a compact recurrent state.
It achieves linear parameter growth and controlled lag by sharing weights and summarizing past features, enhancing efficiency.
Empirical evaluations demonstrate improved performance in classification, detection, and segmentation with only marginal computational overhead.

Recurrent Layer Aggregation (RLA) is a mechanism for feature reuse in deep convolutional neural networks (CNNs) that introduces a parameter-efficient, recurrent aggregation path alongside existing feedforward architectures. By incorporating a compact recurrent state that summarizes information across all previous layers within each resolution stage, RLA achieves effective feature aggregation with linear parameter growth and controlled lag, addressing critical inefficiencies in prior approaches such as DenseNet. RLA modules are compatible with mainstream CNN backbones (ResNet, Xception, MobileNetV2) and have demonstrated empirical improvements on standard benchmarks in image classification, object detection, and instance segmentation (Zhao et al., 2021).

1. Motivation and Background

Layer aggregation refers to the reuse of activations from earlier layers to inform computation at the current layer, formalized as producing new activations $A^t = g^t(x^{t-1}, x^{t-2}, ..., x^0)$ and $x^t = f^t(A^{t-1}, x^{t-1})$ . DenseNet exemplifies this mechanism via concatenation: each layer receives features from all precedents and processes them through learned convolutions. However, DenseNet's approach incurs $\mathcal{O}(L^2)$ parameter growth per $L$ -layer stage and leads to substantial redundancy, as low-lag connections dominate and later layer contributions diminish empirically.

RLA was developed to resolve this by:

Replacing dense skip-connections with a single compact hidden state $h^t$ (the "recurrent aggregator") that summarizes all prior layer outputs,
Employing weight sharing (parameter tying) across depth, and
Achieving $\mathcal{O}(L)$ parameter and computational complexity per stage.

This design yields an aggregation effect mathematically analogous to an ARMA(1,1) process along the network depth axis, giving the RLA module better control over historical information decay while maintaining efficiency (Zhao et al., 2021).

2. Structural Design and Layerwise Operation

Within a typical residual block augmented with RLA, two parallel computational paths operate:

Residual path: Standard two- or three-convolution residual unit produces $y^t$ , yielding $x^t = x^{t-1} + y^t$ .
Recurrent aggregator path:
- $g_1$ : A shared $1\times1$ convolution compresses $x^t = f^t(A^{t-1}, x^{t-1})$ 0 to $x^t = f^t(A^{t-1}, x^{t-1})$ 1 channels.
- $x^t = f^t(A^{t-1}, x^{t-1})$ 2: A shared $x^t = f^t(A^{t-1}, x^{t-1})$ 3 convolution (with batch-normalization and $x^t = f^t(A^{t-1}, x^{t-1})$ 4) updates the hidden state $x^t = f^t(A^{t-1}, x^{t-1})$ 5.
- The recurrent state is updated by $x^t = f^t(A^{t-1}, x^{t-1})$ 6.

This process forms a single, compact “memory” (hidden state) that propagates through every block in a given stage. At input, $x^t = f^t(A^{t-1}, x^{t-1})$ 7; at the stage boundary, $x^t = f^t(A^{t-1}, x^{t-1})$ 8 is spatially downsampled via average pooling to match changing resolutions (e.g., after strided convolutions).

At the network’s terminus for classification, $x^t = f^t(A^{t-1}, x^{t-1})$ 9 is concatenated with $\mathcal{O}(L^2)$ 0 before the final fully-connected classifier. For detection or segmentation with FPN, $\mathcal{O}(L^2)$ 1 is discarded after the backbone (Zhao et al., 2021).

3. Mathematical Formulation

The core recursive equations for RLA are:

Recurrent state update:

$\mathcal{O}(L^2)$ 2

Residual feature update:

$\mathcal{O}(L^2)$ 3

Residual output:

$\mathcal{O}(L^2)$ 4

By recursive unrolling, $\mathcal{O}(L^2)$ 5 can be interpreted as an additive aggregation of all past $\mathcal{O}(L^2)$ 6's, filtered through depth by repeated application of $\mathcal{O}(L^2)$ 7, resulting in an exponentially decaying influence from older layers. This compressed summary replaces the explicit concatenation in DenseNet and analogous systems, yielding memory efficiency and computational tractability (Zhao et al., 2021).

4. Integration with Common Modern CNN Backbones

RLA is compatible with multiple widely-used backbone architectures, with the following integration strategies:

ResNet-50/101/152: Insert one RLA module per residual block; $\mathcal{O}(L^2)$ 8 and $\mathcal{O}(L^2)$ 9 are shared (tied) across all blocks within each stage of constant spatial resolution. State $L$ 0 is initialized to zero, downsampled between stages (by average pooling), and concatenated at output before classification FC.
Xception: RLA modules are shared per resolution-group of depthwise-separable convolutional blocks, with separable $L$ 1.
MobileNetV2: The RLA state is concatenated after the first $L$ 2 expansion in each inverted bottleneck block to avoid channel explosion; $L$ 3 uses depthwise-separable convolutions to control cost.

For all backbones, RLA maintains stage-wise weight sharing, hidden state spatial downsampling, and final concatenation for classification tasks (Zhao et al., 2021).

5. Complexity and Resource Overhead

Compared to standard backbones, RLA introduces minimal parameter and compute overhead, summarized as follows:

Model	Params (M)	FLOPs (G)	Top-1 Err. (%)	Change
ResNet-50	24.37	3.83	24.70	—
+ RLA	24.67	4.17	22.83	+1.87 acc, +1.2% params, +9% FLOPs
ResNet-164	1.72	8.55M	5.72 (C-10)	—
+ RLA	1.74	8.74M	4.95 (C-10)	-0.77 pp err., +1.2% params, +2.2% FLOPs

Training time increases (15–19% on ResNet-101/ImageNet), and inference speed is reduced by 2–3%. This resource increase is offset by marked improvements in accuracy and task performance across datasets (Zhao et al., 2021).

6. Empirical Results on Standard Benchmarks

RLA has been systematically evaluated on CIFAR-10/100, ImageNet, and MS COCO. Uniform accuracy improvements are observed across backbones and tasks:

CIFAR-10/100 Test Error (%):

Model	Params	FLOPs	C-10	C-100
ResNet-110	1.73M	8.67M	6.35	28.51
+ RLA	1.80M	9.04M	5.88	27.44
ResNet-164	1.72M	8.55M	5.72	25.22
+ RLA	1.74M	8.74M	4.95	23.78

ImageNet (single-crop) Top-1 / Top-5 Error:

Model	Params	FLOPs	Top-1	Top-5
ResNet-50	24.37M	3.83G	24.70	7.80
+ RLA	24.67M	4.17G	22.83	6.58
+ ECA+RLA	24.67M	4.18G	22.15	6.11
RLA-ResNet50†	24.67M	4.17G	20.25	5.12

MS COCO Object Detection / Segmentation:

Faster R-CNN @R-50: AP improves from 36.4 → 38.8 (+2.4)
Faster R-CNN @R-101: AP improves from 38.7 → 41.2 (+2.5)
RetinaNet @R-50: AP improves from 35.6 → 37.9 (+2.3)
Mask R-CNN @R-50: bbox AP from 37.2 → 39.5 (+2.3), mask AP from 34.1 → 35.6 (+1.5)

These consistent gains, with marginal computational and parameter penalty, demonstrate the practical advantages of RLA as a module for deep feature aggregation (Zhao et al., 2021).

7. Ablation Studies and Implementation Findings

Extensive ablation on CIFAR and ImageNet explores RLA's design:

Weight sharing across depth within stage is critical: shared RLA achieves lower error and parameter count than unshared.
Feature exchange: Disabling the two-way exchange of $L$ 4 and $L$ 5 paths degrades performance.
ConvLSTM in place of the simple ConvRNN does not yield additional gains and increases resource usage.
Pre-activation (BN→tanh→Conv) for $L$ 6 improves results vs. post-activation.
Connectivity: Among six variants tested, add-then-ConvRNN (RLA's choice) is optimal.
RLA channel size $L$ 7: On CIFAR-10/ResNet-164, the optimal is $L$ 8.

A concise pseudocode reference is provided in the original work, exemplifying stage-wise module structure, weight sharing, and recurrent updates within a PyTorch framework (Zhao et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Recurrence along Depth: Deep Convolutional Neural Networks with Recurrent Layer Aggregation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent Layer Aggregation (RLA).

Recurrent Layer Aggregation in CNNs

1. Motivation and Background

2. Structural Design and Layerwise Operation

3. Mathematical Formulation

4. Integration with Common Modern CNN Backbones

5. Complexity and Resource Overhead

6. Empirical Results on Standard Benchmarks

7. Ablation Studies and Implementation Findings

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Recurrent Layer Aggregation in CNNs

1. Motivation and Background

2. Structural Design and Layerwise Operation

3. Mathematical Formulation

4. Integration with Common Modern CNN Backbones

5. Complexity and Resource Overhead

6. Empirical Results on Standard Benchmarks

7. Ablation Studies and Implementation Findings

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research