Critical-gradient initialization via vanishing gradient Lyapunov exponent (λg = 0)
Determine whether setting the initialization hyperparameters of the deep transformer architecture defined in Section 2 so that the gradient Lyapunov exponent λg equals zero—i.e., exactly at the boundary between exponentially exploding and vanishing end-to-end input–output Jacobian norms—constitutes a good initialization for training deep randomly initialized transformers.
References
Since neither property is conducive to training, a natural conjecture for a good initialization is at the edge between these two phases in which λg = 0.
— Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
(2403.02579 - Cowsik et al., 5 Mar 2024) in Subsection 3.4 (Propagation of gradients)