Universal Initialization Procedure

Updated 29 December 2025

Universal Initialization is a mathematically principled method that defines initial parameter states to maintain signal and gradient flow in deep learning and sensor calibration applications.
It generalizes classical methods like Xavier and Kaiming by analytically deriving moment-preserving and symmetry-based scaling rules for complex architectures.
The approach is applicable across various domains, offering efficient training, robust convergence, and theoretical guarantees even in heavy-tailed or tensorized network scenarios.

A universal initialization procedure is a mathematically principled, architecture-agnostic methodology for constructing initial states of parameters such that training algorithms—typically optimization or inference schemes for deep learning or calibration pipelines in robotics—exhibit stable, efficient convergence and robust generalization. This concept encompasses analytically derived schemes that generalize or unify classical initialization approaches (e.g., Xavier or Kaiming), extend them to complex modern architectures (e.g., tensorized CNNs, very deep cascades, weight-normalized and residual networks), and may also apply to sensor calibration and physical simulations. Universal initialization procedures are characterized by their theoretical property to guarantee signal and gradient propagation across arbitrary depths, widths, topologies, or modalities, often by preserving high-order moments, structural symmetries, or explicit optimization objectives.

1. Mathematical Foundations and Motivations

Multiple universal initialization procedures arise from rigorous analysis of signal propagation and optimization landscapes. Key mathematical motivations include:

Variance and Moment Preservation: Classical variance-preserving rules (e.g., $\mathrm{Var}(W_{ij})=\frac{2}{n}$ for ReLU) seek to maintain the scale of activations and gradients. Fractional-moment–preserving schemes extend this to arbitrary order $s\in(0,2]$ , targeting heavy-tailed scenarios relevant in small-batch or non-Gaussian regimes. This yields generalized scaling rules:

$\mathbb{E}|W_{ij}|^s = \frac{C_s}{n}$

with $C_s$ derived analytically for each activation (Gurbuzbalaban et al., 2020).

Structural Symmetry & Isotropy: Indifference principles motivate symmetric placements in architectures with complex operations, such as the hyperoctahedral constellation for polyharmonic cascades. This guarantees that all spatial directions are treated without bias, preventing both vanishing and exploding signals or derivatives, even in extremely deep architectures (Bakhvalov, 22 Dec 2025).
Architectural Invariance and Graph-Theoretic Variance Control: For tensorial convolutional neural networks (TCNNs), backbone graph representations and hypergraph contractions enable analytical derivation of the multiplicative effects of tensor contractions, yielding closed-form variance scaling rules that generalize across all forms of tensor decomposition (CP, Tucker, Tensor Ring, etc.) (Pan et al., 2022).

Across these procedural classes, universal initialization is characterized by analytical tractability and the ability to recover classical schemes as special cases.

2. Universal Initialization in Deep Neural Architectures

Universal initialization procedures are defined for a range of deep network topologies:

Fully Connected and Convolutional Networks: Fractional-moment (Gurbuzbalaban et al., 2020) and randomized asymmetric (RA) (Lu et al., 2019) initialization rules are implemented by sampling weights (from Gaussian, Pareto, or α-stable distributions) with scaling that exactly preserves the chosen moment or that introduces asymmetry to prevent dying neurons. For instance, RA-initialization with $p > \tfrac{1}{2}$ and non-negative biases guarantees zero probability of entire-layer death, eliminating the "dying ReLU" problem even in the infinite-depth limit.
Residual Connections and Identity Control: In residual networks, procedures such as IDInit (Pan et al., 6 Mar 2025) enforce exact or approximate identity mappings at initialization by constructing padded or repeated identity matrices (IDI) for each weight tensor, including non-square or convolutional layers. For the last sub-stem, tiny perturbations ( $\pm\varepsilon$ ) prevent permanently inactive units. This ensures singular values are close to unity, and, via stochastic symmetry-breaking under SGD, the rank of all layers can grow appropriately.
Tensorial and Factorized Networks: The backbone-graph approach analytically derives the total variance contributed by all contraction edges, resulting in a universal formula for each tensor factor's variance, typically

$\mathrm{Var}[W^{(k)}] = \left(\frac{1}{p_a\prod_{i<j} v_{ij}}\right)^{1/n}$

where $p_a$ is the activation's variance amplification factor; this generalizes both Xavier and Kaiming initialization (Pan et al., 2022).

Weight-Normalized and Orthogonal Structures: Universal scaling in weight-normalized networks ensures preservation of both forward and backward norm—the gain parameter per layer is set as $g_l = \sqrt{2n_{l-1}/n_l}$ and all weights are orthogonally initialized. For ResNets, additional scaling of the last convolution in a block by $1/B_k$ (where $B_k$ is the number of blocks in a stage) prevents cumulative growth or decay of signal magnitude (Arpit et al., 2019).
Polyharmonic and Spline-Based Cascades: The universal initialization of polyharmonic cascades involves constructing constellation matrices as regular hyperoctahedra plus a central point, ensuring isotropy, analytical invertibility, and per-layer cost equivalent to a classical dense MLP. This approach achieves robust, skipless signal propagation across hundreds of layers (Bakhvalov, 22 Dec 2025).

3. Extensions Beyond Neural Networks: Sensor and Physical System Calibration

Universal initialization extends to non-machine learning contexts:

LiDAR–LiDAR Calibration: The TERRA procedure computes a spherical minimum-range descriptor on a uniform Fibonacci lattice, constructs a robust masked $L_1$ distance, and searches $SO(3)$ via a hierarchical grid, yielding sensor-agnostic, targetless rotation initialization effective for all types of LiDARs without any prior or specialized scenes (Duan et al., 2024).
LiDAR–IMU Initialization: A CV motion–based ES-IKF is used as a LiDAR-only odometry front-end, followed by two-stage least-squares to jointly recover temporal offset, extrinsic $SE(3)$ , IMU biases, and gravity. Real-time observability checks and cross-modal optimization ensure convergence and metric-scale accuracy (sub-degree, sub-centimeter, sub-millisecond) independent of sensor model or prior information (Zhu et al., 2022).

4. Algorithmic Workflows and Implementation

Universal initialization procedures are explicitly algorithmic, often admitting closed-form or pseudocode-level descriptions. The following principles are universal:

Architectural Generalization: Procedures automatically adapt to the layer type, e.g., fully connected, convolutional, residual, and multi-branch/attention structures. For example, IDInit prescribes a single identity-like template for all linear or convolutional layers, with sub-stem specialization only for the final block (Pan et al., 6 Mar 2025).
Algorithmic Steps: Across domains, initialization procedures consist of parameter calculation (e.g., kernel-based coefficients, variance scales), construction (e.g., sampling, analytical or symmetric templates), and optional final fine-tuning or downstream distribution (e.g., for sensor extrinsics).
Complexity and Efficiency: Most universal procedures have computational cost no higher than classical initializations (a single forward pass over the same parameter sizes), leverage vectorized or block-symmetric operations for massive speedup (e.g., in polyharmonic cascades, all linear algebra reduces to 2D operations), and tolerate or exploit GPU acceleration natively (Bakhvalov, 22 Dec 2025, Pan et al., 6 Mar 2025).

Pseudocode Example (IDInit, (Pan et al., 6 Mar 2025)):

for layer in network:
    if layer is linear or convolution:
        if layer not last residual sublayer:
            layer.weight = identity_like_matrix(scale)
        else:
            layer.weight = epsilon_perturbed_matrix()

5. Theoretical Guarantees and Empirical Performance

Universal initialization procedures are distinguished by theoretical analysis:

Dynamical Isometry: Rigorous preservation of singular value distributions ensures propagation of both signal and gradient across arbitrary depth (Pan et al., 6 Mar 2025).
Elimination of Pathologies: Procedures such as RA-initialization provably eliminate the dying ReLU phenomenon in fully-connected nets, with $P(\text{any neuron dead at init})=0$ regardless of depth or width (Lu et al., 2019).
Statistical Tail Control: Fractional-moment approaches enable networks to survive in heavy-tailed or small-batch regimes, with provable almost-sure convergence and explicit characterization of output distributions at infinite depth (Gurbuzbalaban et al., 2020).
Empirical Results: Universal procedures demonstrate accelerated early training (e.g., 5–10% top-1 gains and 20–30% reduction in epochs to convergence), stability over hyperparameter seeds, and performance parity—or improvement—over standard batch-normalization or custom schemes, especially for extremely deep or non-standard models (Pan et al., 6 Mar 2025, Arpit et al., 2019, Pan et al., 2022, Bakhvalov, 22 Dec 2025).

6. Generality, Applicability, and Design Considerations

Universal initialization procedures exhibit broad applicability:

Model-Agnosticism: Applicability spans classical MLPs, CNNs, modern attention architectures (ViT, BERT, MoE), factorized/tensorized convolutional networks, and spline/cascade models.
Hyperparameter Simplicity: Most require tuning only a single scalar (e.g., scaling factor or moment order $s$ ) and are robust to activation choices, layer widths, and data modalities.
Unification of Special Cases: Classical Xavier/He initializations are recovered at $s=2$ or in the trivial backbone-graph scenario; analogous specializations emerge through appropriate application of the unified formulas.
Guidelines:
- Use fractional-moment (heavy-tailed) initialization for small-batch or non-Gaussian learning.
- For ReLU nets, favor RA-initialization to prevent dying neurons for $L \gg 20$ .
- For ultra-deep architectures or new tensorized blocks, compute the variance via the backbone graph.
- For cascaded spline architectures or meshless simulation, ensure isotropy by symmetric placement.
- For sensor calibration or robotics, utilize masked descriptors and hierarchical search for initialization.

Universal initialization procedures form a core enabling methodology for modern high-performance learning, optimization, and calibration pipelines, allowing practitioners to unlock the scaling properties of new architectures and sensor configurations while providing rigorous guarantees on trainability, stability, and convergence (Pan et al., 6 Mar 2025, Bakhvalov, 22 Dec 2025, Gurbuzbalaban et al., 2020, Arpit et al., 2019, Pan et al., 2022, Lu et al., 2019, Duan et al., 2024, Zhu et al., 2022).