Recurrent Layer Aggregation in CNNs
- Recurrent Layer Aggregation is a feature reuse mechanism that integrates outputs from previous CNN layers using a compact recurrent state.
- It achieves linear parameter growth and controlled lag by sharing weights and summarizing past features, enhancing efficiency.
- Empirical evaluations demonstrate improved performance in classification, detection, and segmentation with only marginal computational overhead.
Recurrent Layer Aggregation (RLA) is a mechanism for feature reuse in deep convolutional neural networks (CNNs) that introduces a parameter-efficient, recurrent aggregation path alongside existing feedforward architectures. By incorporating a compact recurrent state that summarizes information across all previous layers within each resolution stage, RLA achieves effective feature aggregation with linear parameter growth and controlled lag, addressing critical inefficiencies in prior approaches such as DenseNet. RLA modules are compatible with mainstream CNN backbones (ResNet, Xception, MobileNetV2) and have demonstrated empirical improvements on standard benchmarks in image classification, object detection, and instance segmentation (Zhao et al., 2021).
1. Motivation and Background
Layer aggregation refers to the reuse of activations from earlier layers to inform computation at the current layer, formalized as producing new activations and . DenseNet exemplifies this mechanism via concatenation: each layer receives features from all precedents and processes them through learned convolutions. However, DenseNet's approach incurs parameter growth per -layer stage and leads to substantial redundancy, as low-lag connections dominate and later layer contributions diminish empirically.
RLA was developed to resolve this by:
- Replacing dense skip-connections with a single compact hidden state (the "recurrent aggregator") that summarizes all prior layer outputs,
- Employing weight sharing (parameter tying) across depth, and
- Achieving parameter and computational complexity per stage.
This design yields an aggregation effect mathematically analogous to an ARMA(1,1) process along the network depth axis, giving the RLA module better control over historical information decay while maintaining efficiency (Zhao et al., 2021).
2. Structural Design and Layerwise Operation
Within a typical residual block augmented with RLA, two parallel computational paths operate:
- Residual path: Standard two- or three-convolution residual unit produces , yielding .
- Recurrent aggregator path:
- : A shared convolution compresses 0 to 1 channels.
- 2: A shared 3 convolution (with batch-normalization and 4) updates the hidden state 5.
- The recurrent state is updated by 6.
This process forms a single, compact “memory” (hidden state) that propagates through every block in a given stage. At input, 7; at the stage boundary, 8 is spatially downsampled via average pooling to match changing resolutions (e.g., after strided convolutions).
At the network’s terminus for classification, 9 is concatenated with 0 before the final fully-connected classifier. For detection or segmentation with FPN, 1 is discarded after the backbone (Zhao et al., 2021).
3. Mathematical Formulation
The core recursive equations for RLA are:
- Recurrent state update:
2
- Residual feature update:
3
- Residual output:
4
By recursive unrolling, 5 can be interpreted as an additive aggregation of all past 6's, filtered through depth by repeated application of 7, resulting in an exponentially decaying influence from older layers. This compressed summary replaces the explicit concatenation in DenseNet and analogous systems, yielding memory efficiency and computational tractability (Zhao et al., 2021).
4. Integration with Common Modern CNN Backbones
RLA is compatible with multiple widely-used backbone architectures, with the following integration strategies:
- ResNet-50/101/152: Insert one RLA module per residual block; 8 and 9 are shared (tied) across all blocks within each stage of constant spatial resolution. State 0 is initialized to zero, downsampled between stages (by average pooling), and concatenated at output before classification FC.
- Xception: RLA modules are shared per resolution-group of depthwise-separable convolutional blocks, with separable 1.
- MobileNetV2: The RLA state is concatenated after the first 2 expansion in each inverted bottleneck block to avoid channel explosion; 3 uses depthwise-separable convolutions to control cost.
For all backbones, RLA maintains stage-wise weight sharing, hidden state spatial downsampling, and final concatenation for classification tasks (Zhao et al., 2021).
5. Complexity and Resource Overhead
Compared to standard backbones, RLA introduces minimal parameter and compute overhead, summarized as follows:
| Model | Params (M) | FLOPs (G) | Top-1 Err. (%) | Change |
|---|---|---|---|---|
| ResNet-50 | 24.37 | 3.83 | 24.70 | — |
| + RLA | 24.67 | 4.17 | 22.83 | +1.87 acc, +1.2% params, +9% FLOPs |
| ResNet-164 | 1.72 | 8.55M | 5.72 (C-10) | — |
| + RLA | 1.74 | 8.74M | 4.95 (C-10) | -0.77 pp err., +1.2% params, +2.2% FLOPs |
Training time increases (15–19% on ResNet-101/ImageNet), and inference speed is reduced by 2–3%. This resource increase is offset by marked improvements in accuracy and task performance across datasets (Zhao et al., 2021).
6. Empirical Results on Standard Benchmarks
RLA has been systematically evaluated on CIFAR-10/100, ImageNet, and MS COCO. Uniform accuracy improvements are observed across backbones and tasks:
CIFAR-10/100 Test Error (%):
| Model | Params | FLOPs | C-10 | C-100 |
|---|---|---|---|---|
| ResNet-110 | 1.73M | 8.67M | 6.35 | 28.51 |
| + RLA | 1.80M | 9.04M | 5.88 | 27.44 |
| ResNet-164 | 1.72M | 8.55M | 5.72 | 25.22 |
| + RLA | 1.74M | 8.74M | 4.95 | 23.78 |
ImageNet (single-crop) Top-1 / Top-5 Error:
| Model | Params | FLOPs | Top-1 | Top-5 |
|---|---|---|---|---|
| ResNet-50 | 24.37M | 3.83G | 24.70 | 7.80 |
| + RLA | 24.67M | 4.17G | 22.83 | 6.58 |
| + ECA+RLA | 24.67M | 4.18G | 22.15 | 6.11 |
| RLA-ResNet50† | 24.67M | 4.17G | 20.25 | 5.12 |
MS COCO Object Detection / Segmentation:
- Faster R-CNN @R-50: AP improves from 36.4 → 38.8 (+2.4)
- Faster R-CNN @R-101: AP improves from 38.7 → 41.2 (+2.5)
- RetinaNet @R-50: AP improves from 35.6 → 37.9 (+2.3)
- Mask R-CNN @R-50: bbox AP from 37.2 → 39.5 (+2.3), mask AP from 34.1 → 35.6 (+1.5)
These consistent gains, with marginal computational and parameter penalty, demonstrate the practical advantages of RLA as a module for deep feature aggregation (Zhao et al., 2021).
7. Ablation Studies and Implementation Findings
Extensive ablation on CIFAR and ImageNet explores RLA's design:
- Weight sharing across depth within stage is critical: shared RLA achieves lower error and parameter count than unshared.
- Feature exchange: Disabling the two-way exchange of 4 and 5 paths degrades performance.
- ConvLSTM in place of the simple ConvRNN does not yield additional gains and increases resource usage.
- Pre-activation (BN→tanh→Conv) for 6 improves results vs. post-activation.
- Connectivity: Among six variants tested, add-then-ConvRNN (RLA's choice) is optimal.
- RLA channel size 7: On CIFAR-10/ResNet-164, the optimal is 8.
A concise pseudocode reference is provided in the original work, exemplifying stage-wise module structure, weight sharing, and recurrent updates within a PyTorch framework (Zhao et al., 2021).