Dice Question Streamline Icon: https://streamlinehq.com

Edge-of-chaos initialization via vanishing angle Lyapunov exponent (λa = 0)

Determine whether setting the initialization hyperparameters of the deep transformer architecture defined in Section 2 (single-head self-attention followed by a tokenwise multilayer perceptron with residual connections and layer normalization) so that the angle Lyapunov exponent λa equals zero at the collapsed fixed point of the token-geometry update map constitutes a good initialization for training, in the sense of enabling stable, non-collapsing forward signal propagation in deep randomly initialized transformers.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper analyzes forward signal propagation in deep, randomly initialized transformers by tracking the evolution of the token dot-product matrix under attention and MLP blocks with residual connections. Under a permutation-symmetric ansatz, the authors reduce the geometry to a 2D update map in terms of diagonal and off-diagonal entries (q, p), revealing a collapsed fixed point (p = q) and, depending on hyperparameters, a non-collapsed regular simplex fixed point.

Linearization around the collapsed fixed point defines an angle Lyapunov exponent λa. When λa < 0, tokens collapse exponentially (ordered phase); when λa > 0, nearby tokens diverge exponentially and approach a simplex (chaotic phase). Based on this dichotomy, the authors propose a conjecture that an edge-of-chaos initialization (λa = 0) is good for training.

References

Since neither property seems conducive to stable training, a natural conjecture for a good initialization is at the edge of chaos where λa = 0.

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers (2403.02579 - Cowsik et al., 5 Mar 2024) in Subsection 3.3 (Fixed Points of the Update Map)