Dice Question Streamline Icon: https://streamlinehq.com

Critical-gradient initialization via vanishing gradient Lyapunov exponent (λg = 0)

Determine whether setting the initialization hyperparameters of the deep transformer architecture defined in Section 2 so that the gradient Lyapunov exponent λg equals zero—i.e., exactly at the boundary between exponentially exploding and vanishing end-to-end input–output Jacobian norms—constitutes a good initialization for training deep randomly initialized transformers.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors paper the backpropagation of gradients through deep transformers by analyzing the expected squared Frobenius norm of the end-to-end input–output Jacobian, which typically scales exponentially with depth. This scaling is characterized by a gradient Lyapunov exponent λg: λg < 0 corresponds to vanishing gradients and λg > 0 to exploding gradients.

Motivated by the need to avoid both extremes for trainability, the authors propose a conjecture that initializing at the boundary (λg = 0) provides a good initialization for training.

References

Since neither property is conducive to training, a natural conjecture for a good initialization is at the edge between these two phases in which λg = 0.

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers (2403.02579 - Cowsik et al., 5 Mar 2024) in Subsection 3.4 (Propagation of gradients)