Timescale separation in ReLU networks from small initialization
Establish whether two-layer ReLU networks trained with gradient flow from small initialization exhibit a timescale separation between weight directions analogous to linear networks, and characterize the mechanism and conditions under which this directional separation arises.
Sponsor
References
Because the ReLU activation function is piece-wise linear, we conjecture that ReLU networks trained from small initialization have a timescale separation between different directions, similar to the mechanism in linear networks.
— Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
(2512.20607 - Zhang et al., 23 Dec 2025) in Appendix C — Additional Discussion (ReLU activation function)