Papers
Topics
Authors
Recent
Search
2000 character limit reached

ResNet in ResNet (RiR): Dual-Stream Design

Updated 7 April 2026
  • The paper introduces a dual-stream residual block that combines an identity-preserving residual path with a transient convolution branch to enhance expressivity and gradient flow.
  • The architecture seamlessly interpolates between CNNs and ResNets by selectively enabling or disabling cross-stream connections without extra computational or parameter overhead.
  • Empirical results on CIFAR benchmarks show that RiR achieves state-of-the-art accuracy, outperforming standard ResNets on CIFAR-10 and CIFAR-100.

Resnet in Resnet (RiR) is a deep dual-stream neural architecture that generalizes standard convolutional neural networks (CNNs) and Residual Networks (ResNets) by introducing a dual-path mechanism at each layer. RiR combines a residual stream that preserves feature identity through shortcut connections with a transient stream that processes features using conventional convolution, facilitating expressivity and efficient gradient propagation. This architecture is designed to seamlessly interpolate between pure CNNs and conventional ResNets and is implemented with no computational or parameter overhead relative to the corresponding baseline (Targ et al., 2016).

1. Dual-Stream Residual Block Formulation

The fundamental unit of RiR is the dual-stream residual block. Each layer processes feature vectors through two parallel pathways:

  • Residual stream (rlr_l): Receives an identity shortcut at every layer, analogous to classic ResNet design.
  • Transient stream (tlt_l): Processes features via standard convolution without shortcuts, enabling deeper nonlinear transformations.

The updates at layer ll are:

rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}

where σ(⋅)\sigma(\cdot) denotes Batch Normalization followed by ReLU activation, and shortcut(rl)\mathrm{shortcut}(r_l) is the identity (with channel padding, if needed) or a 3×33\times3 learned projection when feature dimensions change.

  • The weights Wl,r→rW_{l,r\to r}, Wl,t→tW_{l,t\to t} perform self-processing in their respective streams.
  • The cross-term weights Wl,r→tW_{l,r\to t}, tlt_l0 enable bidirectional information flow between streams.
  • By nullifying either all residual-to-residual or transient-to-transient paths, the block reduces to a conventional CNN layer or a single-layer ResNet block, respectively.

In practice, the four convolutions and shortcut addition are consolidated into a single layer via block matrix construction: tlt_l1 with tlt_l2.

2. Network Architectures and Design Space

RiR blocks can be substituted for any convolution in a CNN or for any layer in a ResNet block. This generalizes the architectural spectrum:

  • 32-layer RiR (0.49 M parameters):
    • Input: tlt_l3 RGB image
    • tlt_l4 [RiR-blocks tlt_l5]tlt_l6
    • [RiR-blocks tlt_l7]tlt_l8, stride=2 on first block
    • [RiR-blocks tlt_l9]ll0, stride=2 on first block
    • Global average pool ll1 FC (10/100) ll2 softmax
  • Wide 18-layer RiR (ll310.3 M parameters):
    • Input: ll4 RGB
    • ll5 [RiR-blocks ll6]ll7
    • [RiR-blocks ll8]ll9, stride=2 on first block
    • [RiR-blocks rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}0]rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}1, stride=2 on first block
    • rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}2 Global average pool rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}3 softmax

At each RiR-block (Crl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}4C’), the residual and transient streams match the output channel count C’. All designs retain the parameter and computational efficiency of their respective ResNet/CNN baselines (Targ et al., 2016).

RiR precisely recovers a standard CNN if all cross-stream and residual paths are zeroed, and a ResNet if only the transient stream is suppressed. Thus, the architecture can be continuously tuned between explicit feature preservation and nonlinear transformation.

3. Optimization and Training Protocol

The RiR training methodology replicates ResNet baselines to isolate architectural contributions:

  • Optimizer: SGD with momentum 0.9
  • Batch size: 500
  • Weight decay: rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}5 (applied after removing partial identity to maintain shortcut exactness under "ResNet Init")
  • Initialization: MSR (He) for weights, plus partial identity for the residual pathway
  • Learning rate schedule: 0.1 initially, reduced by a factor of 10 at epochs 42 and 62, total of 82 epochs
  • Shortcut implementation: rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}6 projection when changing dimensions; identity otherwise
  • BatchNorm: standard during training; test statistics via exponential moving average of training batch moments
  • Data augmentation: standard CIFAR protocol—random crops with rl+1=σ(conv(rl, Wl,r→r)+conv(tl, Wl,t→r)+shortcut(rl)) tl+1=σ(conv(rl, Wl,r→t)+conv(tl, Wl,t→t))(1)\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}7-pixel padding and horizontal flips

These procedures ensure any performance improvement stems from the dual-stream RiR structure.

4. Empirical Performance on CIFAR Benchmarks

RiR achieves state-of-the-art results on CIFAR-10 and CIFAR-100, outperforming comparably sized CNN and ResNet architectures under identical data augmentation. The following tables summarize the published top-line results:

Table 1: CIFAR-10 Accuracy Comparison

Model Accuracy (%)
Highway Network 92.40
ResNet (32-layer) 92.49
ResNet (110-layer) 93.57
Large ALL-CNN 95.59
Fractional Max-Pooling 96.53
18-layer + wide CNN 93.64
18-layer + wide ResNet 93.95
18-layer + wide ResNet Init 94.28
18-layer + wide RiR 94.99

Table 2: CIFAR-100 Accuracy Comparison

Model Accuracy (%)
Highway Network 67.76
ELU-Network 75.72
18-layer + wide CNN 75.17
18-layer + wide ResNet 76.58
18-layer + wide ResNet Init 75.99
18-layer + wide RiR 77.10

Notably, the 32-layer RiR (0.49M params) marginally surpasses the original 32-layer ResNet (92.97% vs. 92.49% on CIFAR-10). The wide 18-layer RiR establishes a new state-of-the-art on CIFAR-100 at 77.10% (Targ et al., 2016).

5. Ablation Insights and Design Rationale

RiR establishes several analytical findings regarding the function and necessity of dual streams:

  • Complementary roles: The residual stream ensures stable gradient flow and preserves untransformed features, while the transient stream enables complex nonlinear transformations and selective feature discarding/remapping prior to re-entering the residual path.
  • Ablation study: Nullifying all residual-to-residual or transient-to-transient links in a trained "ResNet Init" network considerably reduces performance, demonstrating both streams' critical contributions. The importance of each stream varies with depth.
  • Depth stability: With increasing block depth, RiR continues to train stably and improves in accuracy, whereas classic ResNet blocks degrade. The dual-stream construct regularizes effective residual depth.

A plausible implication is that dual-stream designs grant models the flexibility to modulate the balance between identity mapping and deep transformation based on task requirements.

6. Implementation Considerations and Applicability

Implementation of RiR requires only a single convolution or fully connected layer per dual-stream block, initialized according to the matrix structure in Eq. (2). For L2 regularization, the partial identity must be subtracted before applying weight decay to maintain the residual's precise influence.

This construct can be applied to any CNN or FCN layer without introducing new parameters or computational costs. This suggests broad applicability to diverse architectures (e.g., DenseNet, Inception) and vision tasks (e.g., detection, segmentation), providing a tunable mechanism for balancing feature preservation (residual pathway) and transformation (transient pathway) (Targ et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResNet in ResNet (RiR).