Timescale separation in ReLU networks from small initialization

Establish whether two-layer ReLU networks trained with gradient flow from small initialization exhibit a timescale separation between weight directions analogous to linear networks, and characterize the mechanism and conditions under which this directional separation arises.

Background

The main text shows that linear networks exhibit a timescale separation between directions due to data singular values, while quadratic activations exhibit separation between units due to initialization. Because ReLU is piece-wise linear, the authors conjecture ReLU behaves like the linear case near zero.

Verifying this conjecture would unify the mechanism across common nonlinear activations and help predict when ReLU networks display stage-like learning via directional alignment.

References

Because the ReLU activation function is piece-wise linear, we conjecture that ReLU networks trained from small initialization have a timescale separation between different directions, similar to the mechanism in linear networks.

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures (2512.20607 - Zhang et al., 23 Dec 2025) in Appendix C — Additional Discussion (ReLU activation function)