ResNet in ResNet (RiR): Dual-Stream Design

Updated 7 April 2026

The paper introduces a dual-stream residual block that combines an identity-preserving residual path with a transient convolution branch to enhance expressivity and gradient flow.
The architecture seamlessly interpolates between CNNs and ResNets by selectively enabling or disabling cross-stream connections without extra computational or parameter overhead.
Empirical results on CIFAR benchmarks show that RiR achieves state-of-the-art accuracy, outperforming standard ResNets on CIFAR-10 and CIFAR-100.

Resnet in Resnet (RiR) is a deep dual-stream neural architecture that generalizes standard convolutional neural networks (CNNs) and Residual Networks (ResNets) by introducing a dual-path mechanism at each layer. RiR combines a residual stream that preserves feature identity through shortcut connections with a transient stream that processes features using conventional convolution, facilitating expressivity and efficient gradient propagation. This architecture is designed to seamlessly interpolate between pure CNNs and conventional ResNets and is implemented with no computational or parameter overhead relative to the corresponding baseline (Targ et al., 2016).

1. Dual-Stream Residual Block Formulation

The fundamental unit of RiR is the dual-stream residual block. Each layer processes feature vectors through two parallel pathways:

Residual stream ( $r_l$ ): Receives an identity shortcut at every layer, analogous to classic ResNet design.
Transient stream ( $t_l$ ): Processes features via standard convolution without shortcuts, enabling deeper nonlinear transformations.

The updates at layer $l$ are:

$\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$

where $\sigma(\cdot)$ denotes Batch Normalization followed by ReLU activation, and $\mathrm{shortcut}(r_l)$ is the identity (with channel padding, if needed) or a $3\times3$ learned projection when feature dimensions change.

The weights $W_{l,r\to r}$ , $W_{l,t\to t}$ perform self-processing in their respective streams.
The cross-term weights $W_{l,r\to t}$ , $t_l$ 0 enable bidirectional information flow between streams.
By nullifying either all residual-to-residual or transient-to-transient paths, the block reduces to a conventional CNN layer or a single-layer ResNet block, respectively.

In practice, the four convolutions and shortcut addition are consolidated into a single layer via block matrix construction: $t_l$ 1 with $t_l$ 2.

2. Network Architectures and Design Space

RiR blocks can be substituted for any convolution in a CNN or for any layer in a ResNet block. This generalizes the architectural spectrum:

32-layer RiR (0.49 M parameters):
- Input: $t_l$ 3 RGB image
- $t_l$ 4 [RiR-blocks $t_l$ 5] $t_l$ 6
- [RiR-blocks $t_l$ 7] $t_l$ 8, stride=2 on first block
- [RiR-blocks $t_l$ 9] $l$ 0, stride=2 on first block
- Global average pool $l$ 1 FC (10/100) $l$ 2 softmax
Wide 18-layer RiR ( $l$ 310.3 M parameters):
- Input: $l$ 4 RGB
- $l$ 5 [RiR-blocks $l$ 6] $l$ 7
- [RiR-blocks $l$ 8] $l$ 9, stride=2 on first block
- [RiR-blocks $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 0] $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 1, stride=2 on first block
- $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 2 Global average pool $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 3 softmax

At each RiR-block (C $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 4C’), the residual and transient streams match the output channel count C’. All designs retain the parameter and computational efficiency of their respective ResNet/CNN baselines (Targ et al., 2016).

RiR precisely recovers a standard CNN if all cross-stream and residual paths are zeroed, and a ResNet if only the transient stream is suppressed. Thus, the architecture can be continuously tuned between explicit feature preservation and nonlinear transformation.

3. Optimization and Training Protocol

The RiR training methodology replicates ResNet baselines to isolate architectural contributions:

Optimizer: SGD with momentum 0.9
Batch size: 500
Weight decay: $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 5 (applied after removing partial identity to maintain shortcut exactness under "ResNet Init")
Initialization: MSR (He) for weights, plus partial identity for the residual pathway
Learning rate schedule: 0.1 initially, reduced by a factor of 10 at epochs 42 and 62, total of 82 epochs
Shortcut implementation: $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 6 projection when changing dimensions; identity otherwise
BatchNorm: standard during training; test statistics via exponential moving average of training batch moments
Data augmentation: standard CIFAR protocol—random crops with $\begin{aligned} r_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to r}) + \mathrm{conv}(t_l,\,W_{l,t\to r}) + \mathrm{shortcut}(r_l)\bigr) \ t_{l+1} &= \sigma\bigl(\mathrm{conv}(r_l,\,W_{l,r\to t}) + \mathrm{conv}(t_l,\,W_{l,t\to t})\bigr) \tag{1} \end{aligned}$ 7-pixel padding and horizontal flips

These procedures ensure any performance improvement stems from the dual-stream RiR structure.

4. Empirical Performance on CIFAR Benchmarks

RiR achieves state-of-the-art results on CIFAR-10 and CIFAR-100, outperforming comparably sized CNN and ResNet architectures under identical data augmentation. The following tables summarize the published top-line results:

Table 1: CIFAR-10 Accuracy Comparison

Model	Accuracy (%)
Highway Network	92.40
ResNet (32-layer)	92.49
ResNet (110-layer)	93.57
Large ALL-CNN	95.59
Fractional Max-Pooling	96.53
18-layer + wide CNN	93.64
18-layer + wide ResNet	93.95
18-layer + wide ResNet Init	94.28
18-layer + wide RiR	94.99

Table 2: CIFAR-100 Accuracy Comparison

Model	Accuracy (%)
Highway Network	67.76
ELU-Network	75.72
18-layer + wide CNN	75.17
18-layer + wide ResNet	76.58
18-layer + wide ResNet Init	75.99
18-layer + wide RiR	77.10

Notably, the 32-layer RiR (0.49M params) marginally surpasses the original 32-layer ResNet (92.97% vs. 92.49% on CIFAR-10). The wide 18-layer RiR establishes a new state-of-the-art on CIFAR-100 at 77.10% (Targ et al., 2016).

5. Ablation Insights and Design Rationale

RiR establishes several analytical findings regarding the function and necessity of dual streams:

Complementary roles: The residual stream ensures stable gradient flow and preserves untransformed features, while the transient stream enables complex nonlinear transformations and selective feature discarding/remapping prior to re-entering the residual path.
Ablation study: Nullifying all residual-to-residual or transient-to-transient links in a trained "ResNet Init" network considerably reduces performance, demonstrating both streams' critical contributions. The importance of each stream varies with depth.
Depth stability: With increasing block depth, RiR continues to train stably and improves in accuracy, whereas classic ResNet blocks degrade. The dual-stream construct regularizes effective residual depth.

A plausible implication is that dual-stream designs grant models the flexibility to modulate the balance between identity mapping and deep transformation based on task requirements.

6. Implementation Considerations and Applicability

Implementation of RiR requires only a single convolution or fully connected layer per dual-stream block, initialized according to the matrix structure in Eq. (2). For L2 regularization, the partial identity must be subtracted before applying weight decay to maintain the residual's precise influence.

This construct can be applied to any CNN or FCN layer without introducing new parameters or computational costs. This suggests broad applicability to diverse architectures (e.g., DenseNet, Inception) and vision tasks (e.g., detection, segmentation), providing a tunable mechanism for balancing feature preservation (residual pathway) and transformation (transient pathway) (Targ et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Resnet in Resnet: Generalizing Residual Architectures (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ResNet in ResNet (RiR).