ResNet Init: Theory & Practice

Updated 7 April 2026

ResNet Init is a set of theoretical and algorithmic strategies designed to initialize deep residual networks, ensuring stable signal propagation and effective optimization.
It employs shortcut-2 architectures and tailored weight scaling to maintain invariant training error and optimal Hessian conditioning regardless of network depth.
Empirical comparisons reveal that depth-aware initialization schemes, including deterministic methods like ZerO, enhance convergence stability and performance even in normalization-free settings.

Residual Network initialization (“ResNet Init”) refers to both the theoretical and algorithmic strategies used to initialize weights in deep residual architectures, ensuring stable signal propagation and effective optimization. The convergence and generalization performance of very deep ResNets are tightly governed by the interaction between the structure of shortcut connections, the spectral properties of the loss Hessian at initialization, variance-preserving properties of the forward and backward passes, and practical adaptations such as batch normalization, specialized scaling, or deterministic schemes.

1. Theoretical Foundations of Residual Initialization

The central insight of ResNet initialization is that the shortcut connection’s depth controls both the stationarity order and the Hessian conditioning of the loss at initialization. In classical feedforward networks or residual nets with shortcut length $n=1$ , the loss landscape at the zero-initialization point is characterized by a Toeplitz-block Hessian whose condition number $\kappa(H(0))$ grows as $O(L)$ with the network depth $L$ , leading to extremely slow or unstable gradients as $L \rightarrow \infty$ . For shortcuts of length $n\geq 3$ , as appear in bottleneck blocks, the Hessian at zero is exactly zero (the loss is a high-order saddle), which traps first-order optimizers (Li et al., 2016).

In contrast, when the residual branch contains precisely $n=2$ layers, the Hessian adopts a block–off–diagonal structure with spectrum $\{\pm \sqrt{\lambda_j(A^\intercal A)}\}$ independent of depth. Thus, the condition number at initialization is determined solely by the data covariance and activation derivatives, not by network depth. This property underpins successful training of arbitrarily deep ResNets with two-layer residual blocks. Empirically, only shortcut-2 designs exhibit depth-invariant training error as $L$ increases, with other depths suffering from failure modes: exploding curvature (shortcut-1/plain), or flat saddles (shortcut $\geq$ 3) (Li et al., 2016).

2. Initialization Schemes and Signal Propagation

Weight initialization in ResNets is tailored to the unique architecture:

For classic He (MSRA) or Xavier (Glorot) Gaussian schemes, forward and backward signal variances are preserved only approximately, and curvature/hessian condition numbers deteriorate rapidly with depth, stalling optimization in very deep stacks (Li et al., 2016, Taki, 2017).
Analyses based on mean-field theory prescribe scaling the residual-branch weight variances inversely with both the fan-in and the number of residual blocks: $\kappa(H(0))$ 0, which ensures that both forward activations and gradients remain $\kappa(H(0))$ 1 even as $\kappa(H(0))$ 2 (Taki, 2017).

Batch normalization (BN), when present, renders the network almost entirely insensitive to scale at initialization, turning the exponential growth of gradient variance (in depth) into linear: $\kappa(H(0))$ 3. However, in normalization-free settings, careful initialization and skip-scaling are essential (Taki, 2017, Civitelli et al., 2021).

3. Empirical Comparison of Initialization Approaches

Empirical studies consistently demonstrate the superiority of depth-aware and architecture-aware initialization schemes in residual networks:

Initialization Scheme	Depth Stability (no BN)	Hessian Condition at Init	Test Error (CIFAR-10, ResNet-18)
Xavier / He Gaussian	Fails beyond $\kappa(H(0))$ 4	$\kappa(H(0))$ 5	$\kappa(H(0))$ 65.23%
Orthogonal	Slightly improved	$\kappa(H(0))$ 7 slower, still diverges	Not directly reported
Shortcut-2 zero init	Stable to $\kappa(H(0))$ 8	$\kappa(H(0))$ 9 constant	$O(L)$ 05.13%
ZerO (deterministic)	Stable and reproducible	Preserves dynamical isometry	$O(L)$ 15.13\%
Proposed WN/mean-field init	Stable up to 200 layers	Lowest curvature at init	$O(L)$ 24.7–7.7% (depends on BN and warmup)
Normalization-free scaling $O(L)$ 3	Stable to $O(L)$ 4	Forward/backward variances preserved	Closes gap to BN baseline

Shortcut-2 zero-centered initialization (with possible tiny random perturbation) enables learning rates $O(L)$ 5 larger compared to Gaussian schemes and achieves depth-independent convergence and loss (Li et al., 2016, Zhao et al., 2021, Arpit et al., 2019, Civitelli et al., 2021).

4. Concrete Initialization Algorithms

Typical initialization prescriptions for deep ResNets focus on both weight scaling and architectural details. The canonical “zero-centered, shortcut-2” procedure (Li et al., 2016) and normalization-free block-scaling (Civitelli et al., 2021) are summarized below:

$L$ 2

$L$ 3

For schemes such as ZerO, deterministic initialization using identity and Hadamard transforms, possibly zeroing the last convolution in each residual path to maintain dynamical isometry, provides robust and reproducible training even without BN (Zhao et al., 2021).

5. Special Cases and Extensions

The presence or absence of normalization techniques, the shortcut depth, and residual block composition (bottleneck vs. basic) induce salient modifications:

Bottleneck blocks ( $O(L)$ 6) lack nonzero Hessian at zero; thus, initialization of residual convolutions as zero and careful placement of BN layers or alternative scaling are needed to avoid saddle behavior (Li et al., 2016, Zhao et al., 2021).
Weight-normalized ResNets require per-block gains scaled as $O(L)$ 7 for the last conv in stage $O(L)$ 8, ensuring propagation of $O(L)$ 9 norm signals (Arpit et al., 2019).
Deterministic initializations (e.g., ZerO) have been shown to enhance reproducibility, enable ultra-deep training without BN, and promote low-rank, sparse solution trajectories (Zhao et al., 2021).

6. Practical Considerations, Limitations, and Best Practices

Essential recommendations arising from multiple works include:

Small initial weight norms ( $L$ 0) maintain the network in the convex-like region around zero, beneficial for optimization (Li et al., 2016).
Learning rate warmup (linear or cosine over 5–10 epochs) is increasingly critical for non-random or low-rank initializations to prevent gradient explosion at startup (Zhao et al., 2021, Arpit et al., 2019).
With batch normalization, precise control over variance at init is less crucial, but normalization-free and deterministic approaches require precise scaling (e.g., $L$ 1 for summing skip and residual) to avoid exponential variance growth (Civitelli et al., 2021).
Tiny random perturbations are helpful (but not mandatory) to aid escape from strict saddle points present with zero-mean initialization, especially in shortcut-2 architectures (Li et al., 2016).
For width-increasing layers, attention to the expressivity loss of pure identity initialization is needed; insertion of Hadamard or other orthogonal transforms can restore full-rank propagation (Zhao et al., 2021).

A plausible implication is that, while ResNet architectures are more robust than plain networks to the details of initialization, optimal ResNet Init strategies require matching the scheme to the architecture, normalization protocol, and learning rate schedule.

7. Impact and Ongoing Developments

ResNet Init—spanning architecture-aware randomization, deterministic identity/Hadamard assignment, and normalization-free block scaling—remains foundational for deep learning. As practitioners extend depth, forgo batch normalization, or seek deterministic and scalable initialization, the convergence, generalization, and reproducibility properties made possible by these schemes continue to define state-of-the-art practice (Li et al., 2016, Taki, 2017, Zhao et al., 2021, Arpit et al., 2019, Civitelli et al., 2021). The key insight is the deep connection between shortcut depth, Hessian spectrum at zero, skip scaling, and the resulting optimization landscape, which collectively empower deep residual networks to train effectively across a range of modalities and depths.