ResNet in ResNet (RiR): Dual-Stream Design
- The paper introduces a dual-stream residual block that combines an identity-preserving residual path with a transient convolution branch to enhance expressivity and gradient flow.
- The architecture seamlessly interpolates between CNNs and ResNets by selectively enabling or disabling cross-stream connections without extra computational or parameter overhead.
- Empirical results on CIFAR benchmarks show that RiR achieves state-of-the-art accuracy, outperforming standard ResNets on CIFAR-10 and CIFAR-100.
Resnet in Resnet (RiR) is a deep dual-stream neural architecture that generalizes standard convolutional neural networks (CNNs) and Residual Networks (ResNets) by introducing a dual-path mechanism at each layer. RiR combines a residual stream that preserves feature identity through shortcut connections with a transient stream that processes features using conventional convolution, facilitating expressivity and efficient gradient propagation. This architecture is designed to seamlessly interpolate between pure CNNs and conventional ResNets and is implemented with no computational or parameter overhead relative to the corresponding baseline (Targ et al., 2016).
1. Dual-Stream Residual Block Formulation
The fundamental unit of RiR is the dual-stream residual block. Each layer processes feature vectors through two parallel pathways:
- Residual stream (): Receives an identity shortcut at every layer, analogous to classic ResNet design.
- Transient stream (): Processes features via standard convolution without shortcuts, enabling deeper nonlinear transformations.
The updates at layer are:
where denotes Batch Normalization followed by ReLU activation, and is the identity (with channel padding, if needed) or a learned projection when feature dimensions change.
- The weights , perform self-processing in their respective streams.
- The cross-term weights , 0 enable bidirectional information flow between streams.
- By nullifying either all residual-to-residual or transient-to-transient paths, the block reduces to a conventional CNN layer or a single-layer ResNet block, respectively.
In practice, the four convolutions and shortcut addition are consolidated into a single layer via block matrix construction: 1 with 2.
2. Network Architectures and Design Space
RiR blocks can be substituted for any convolution in a CNN or for any layer in a ResNet block. This generalizes the architectural spectrum:
- 32-layer RiR (0.49 M parameters):
- Input: 3 RGB image
- 4 [RiR-blocks 5]6
- [RiR-blocks 7]8, stride=2 on first block
- [RiR-blocks 9]0, stride=2 on first block
- Global average pool 1 FC (10/100) 2 softmax
- Wide 18-layer RiR (310.3 M parameters):
- Input: 4 RGB
- 5 [RiR-blocks 6]7
- [RiR-blocks 8]9, stride=2 on first block
- [RiR-blocks 0]1, stride=2 on first block
- 2 Global average pool 3 softmax
At each RiR-block (C4C’), the residual and transient streams match the output channel count C’. All designs retain the parameter and computational efficiency of their respective ResNet/CNN baselines (Targ et al., 2016).
RiR precisely recovers a standard CNN if all cross-stream and residual paths are zeroed, and a ResNet if only the transient stream is suppressed. Thus, the architecture can be continuously tuned between explicit feature preservation and nonlinear transformation.
3. Optimization and Training Protocol
The RiR training methodology replicates ResNet baselines to isolate architectural contributions:
- Optimizer: SGD with momentum 0.9
- Batch size: 500
- Weight decay: 5 (applied after removing partial identity to maintain shortcut exactness under "ResNet Init")
- Initialization: MSR (He) for weights, plus partial identity for the residual pathway
- Learning rate schedule: 0.1 initially, reduced by a factor of 10 at epochs 42 and 62, total of 82 epochs
- Shortcut implementation: 6 projection when changing dimensions; identity otherwise
- BatchNorm: standard during training; test statistics via exponential moving average of training batch moments
- Data augmentation: standard CIFAR protocol—random crops with 7-pixel padding and horizontal flips
These procedures ensure any performance improvement stems from the dual-stream RiR structure.
4. Empirical Performance on CIFAR Benchmarks
RiR achieves state-of-the-art results on CIFAR-10 and CIFAR-100, outperforming comparably sized CNN and ResNet architectures under identical data augmentation. The following tables summarize the published top-line results:
Table 1: CIFAR-10 Accuracy Comparison
| Model | Accuracy (%) |
|---|---|
| Highway Network | 92.40 |
| ResNet (32-layer) | 92.49 |
| ResNet (110-layer) | 93.57 |
| Large ALL-CNN | 95.59 |
| Fractional Max-Pooling | 96.53 |
| 18-layer + wide CNN | 93.64 |
| 18-layer + wide ResNet | 93.95 |
| 18-layer + wide ResNet Init | 94.28 |
| 18-layer + wide RiR | 94.99 |
Table 2: CIFAR-100 Accuracy Comparison
| Model | Accuracy (%) |
|---|---|
| Highway Network | 67.76 |
| ELU-Network | 75.72 |
| 18-layer + wide CNN | 75.17 |
| 18-layer + wide ResNet | 76.58 |
| 18-layer + wide ResNet Init | 75.99 |
| 18-layer + wide RiR | 77.10 |
Notably, the 32-layer RiR (0.49M params) marginally surpasses the original 32-layer ResNet (92.97% vs. 92.49% on CIFAR-10). The wide 18-layer RiR establishes a new state-of-the-art on CIFAR-100 at 77.10% (Targ et al., 2016).
5. Ablation Insights and Design Rationale
RiR establishes several analytical findings regarding the function and necessity of dual streams:
- Complementary roles: The residual stream ensures stable gradient flow and preserves untransformed features, while the transient stream enables complex nonlinear transformations and selective feature discarding/remapping prior to re-entering the residual path.
- Ablation study: Nullifying all residual-to-residual or transient-to-transient links in a trained "ResNet Init" network considerably reduces performance, demonstrating both streams' critical contributions. The importance of each stream varies with depth.
- Depth stability: With increasing block depth, RiR continues to train stably and improves in accuracy, whereas classic ResNet blocks degrade. The dual-stream construct regularizes effective residual depth.
A plausible implication is that dual-stream designs grant models the flexibility to modulate the balance between identity mapping and deep transformation based on task requirements.
6. Implementation Considerations and Applicability
Implementation of RiR requires only a single convolution or fully connected layer per dual-stream block, initialized according to the matrix structure in Eq. (2). For L2 regularization, the partial identity must be subtracted before applying weight decay to maintain the residual's precise influence.
This construct can be applied to any CNN or FCN layer without introducing new parameters or computational costs. This suggests broad applicability to diverse architectures (e.g., DenseNet, Inception) and vision tasks (e.g., detection, segmentation), providing a tunable mechanism for balancing feature preservation (residual pathway) and transformation (transient pathway) (Targ et al., 2016).