Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gated Residual Memory in Deep CNNs

Updated 3 March 2026
  • GRM is a novel architecture that integrates an LSTM memory pathway with residual networks to improve gradient flow and enable efficient feature reuse.
  • It operates in parallel to standard ResNet computation by pooling block outputs and using recurrent LSTM updates to capture sequential feature dynamics.
  • Empirical results on benchmarks like CIFAR-10, CIFAR-100, and SVHN show that GRM-ResNets achieve higher accuracy with a moderate increase in parameters.

Gated Residual Memory (GRM) is an architectural mechanism for convolutional neural networks (CNNs) that augments deep residual networks (ResNets) with an explicit, recurrent long short-term memory (LSTM) pathway. The GRM module operates in parallel to the main ResNet stack, accumulating and selectively gating information from each residual block’s output without interfering with the ResNet’s original forward path. Its core objective is to address optimization pathologies in very deep networks, particularly vanishing gradients, while facilitating efficient cross-layer information flow and feature reuse. By integrating an LSTM-based memory interface that simultaneously tracks sequential feature transformations, GRM supports classification via a final fusion of CNN feature hierarchies and memory states (Moniz et al., 2016).

1. Architectural Structure

The GRM-enhanced ResNet consists of a standard stack of residual blocks, each computing feature transforms and skip connections as in the canonical ResNet paradigm. After each block’s post-addition and ReLU activation, the resulting feature map undergoes spatial mean-pooling to reduce dimensionality. This pooled feature vector serves as the GRM input for a recurrent step in an LSTM module. The LSTM state (hidden and cell vectors) propagates through each residual block stage, evolving as a gated function of the pooled feature representations.

The GRM mechanism remains strictly parallel to the ResNet forward computation: residual blocks receive standard inputs and emit standard outputs, unaffected by the presence of the memory path. Upon completion of the final block, the global-pooled ResNet feature vector and the LSTM’s final hidden state hTh_T are combined—typically via concatenation—as input to the classifier layer. This dual-state concatenation enables the classifier to leverage both aggregated CNN features and temporally-integrated memory.

Data flow summary:

  • Input image → initial convolution → residual blocks (each block's output mean-pooled, fed to GRM) → global pool (ResNet) and final LSTM hidden state → classifier on concatenated features.

2. Mathematical Formulation

For a residual network of TT blocks, index each block by tt. Let ytRF×H×Wy_t \in \mathbb{R}^{F \times H \times W} denote the feature map post-addition and ReLU at block tt. After mean-pooling, xtRdx_t \in \mathbb{R}^d is the reduced-dimension GRM input. The LSTM update equations at step tt are:

  • Input gate: it=σ(Wixt+Uiht1+bi)i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
  • Forget gate: ft=σ(Wfxt+Ufht1+bf)f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
  • Output gate: ot=σ(Woxt+Uoht1+bo)o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)
  • Cell proposal: c~t=tanh(Wcxt+Ucht1+bc)\tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)
  • Cell update (with optional direct input): ct=ftct1+itc~t+αxtc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t + \alpha x_t
  • Hidden state: ht=ottanh(ct)h_t = o_t \odot \tanh(c_t)

Here, σ\sigma denotes the sigmoid function, \odot is element-wise multiplication, and α\alpha is typically set to 1 to directly incorporate the current feature into the memory cell. At the conclusion of the stack, the classifier input is formed by concatenating the ResNet’s global-pooled vector gg and the final memory state hTh_T:

z=[g;hT]z = [g; h_T]

The prediction is computed as:

y^=softmax(Vz+b)\hat{y} = \text{softmax}(Vz + b)

where VV projects the concatenated features to the output dimension.

3. Implementation Details

The implementation adheres to standard ResNet configurations, varying depth and breadth for experimental comparison:

  • Network depth: Configurations follow the paradigm $6n+2$ layers, with n{5,10,15,17,21,27}n\in\{5, 10, 15, 17, 21, 27\}, resulting in depths from 32 to 164 layers.
  • Breadth: Initial residual feature-map counts F1F_1 are chosen from {16,24,32,64,96,128,160,192}\{16, 24, 32, 64, 96, 128, 160, 192\}, doubling with each halving of spatial dimension.
  • Memory parameters: LSTM hidden and cell sizes both set to h=100h=100.
  • Gate weights: Parameterized as WRh×dW_*\in\mathbb{R}^{h \times d} and URh×hU_*\in\mathbb{R}^{h \times h}, with dd determined by the mean-pooled feature dimension at each stage.
  • Optimization: Trained with SGD using Nesterov momentum ($0.9$), weight decay 10410^{-4}, and staged learning rate reductions on validation plateau. Initialization employs He for convolutions and orthonormal/identity-scaled for LSTM gates; forget gate biases are set to +1+1, input gates slightly negative.

Regularization includes standard data augmentations—zero-padding, random cropping, and horizontal flips—with batch normalization optionally applied to ResNet pre-activations. Dropout is not employed on the main path.

4. Comparative Performance

Empirical evaluations demonstrate that GRM-augmented ResNets (GRM-ResNet) can achieve improved or competitive accuracy relative to both baseline ResNets and wide ResNets across standard benchmarks. Tables below summarize key results:

CIFAR-100 Test Accuracy

Model Depth Init fm Params (M) Accuracy (%)
ResNet 32 16 0.47 69.53
ResNet 32 64 7.41 75.73
GRM-ResNet 32 16 2.16 70.84
GRM-ResNet 32 64 14.0 76.39
Wide ResNet 28 160 36.5 79.50
GRM-ResNet 32 192 55.3 80.21

CIFAR-10 and SVHN Show Comparable Patterns

  • CIFAR-10: GRM-ResNet (32, 192 fm) achieves 95.84%
  • SVHN: GRM-ResNet (56, 48 fm) achieves 98.32%

Ablation on CIFAR-100 (32-layer, 64 fm): Adding GRM memory yields a +0.66% absolute test accuracy gain over the same-depth ResNet (Moniz et al., 2016).

5. Gradient Dynamics and Information Flow

The GRM architecture provides a parallel recurrent path that augments the skip connections of the base ResNet. Gradients propagate through both the explicit memory cell and the residual connections, ameliorating vanishing-gradient problems in deep hierarchies. The recurrent LSTM pathway endows the model with long-term storage across block outputs; this facilitates retention of low-level features which can inform late-stage predictions. The structure is designed so that both high-level (cnn) and temporally-aggregated (memory) representations participate in downstream classification.

The gating mechanism in the LSTM ensures that only salient temporal features are retained, allowing cross-layer feature reuse and filtering irrelevant signal through time. A plausible implication is that this design enables integrative reasoning across multiple representation levels that standard feed-forward architectures may underutilize.

6. Depth-Breadth Trade-offs and Scalability

Empirical results indicate that GRM is especially effective when applied to shallow, wide ResNets, often outperforming much deeper but narrower backbones at comparable or lower computational cost. The observed diminishing returns in deeper, narrow architectures suggest that the memory pathway’s utility depends on sufficient feature dimensionality to provide the recurrent module with rich information. With excessive depth and underprovisioned feature dimensionality, marginal gains from GRM integration decrease. Therefore, optimizing the balance of network breadth and memory capacity is critical for maximal benefit.

7. Integrative Memory in Residual Architectures

The Gated Residual Memory mechanism constitutes a lightweight, modular interface for incorporating temporal and recurrent modeling into residual CNNs. It introduces moderate increases in the parameter count and computational footprint, with consistent improvements on image recognition tasks. The approach retains compatibility with standard ResNet initialization, training, and architectural design, facilitating direct comparative evaluation and practical integration into existing pipelines (Moniz et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Residual Memory (GRM).