Activation-order predictor for timescale separation and scaling effects in deep layers

Establish whether, in deep networks with layers of the form f(x; 01:H) = gout(∑i φ(gin(x); ui)vi), the order of the layer’s dependence on unit-specific weights ui in φ(gin(x); ui) (linear versus quadratic) predicts the type of timescale separation (between directions versus between units) in that layer’s gradient flow dynamics and determines how layer width and data distribution affect learning plateaus.

Background

The paper’s dynamics analysis focuses on two-layer networks but documents saddle-to-saddle behavior in deeper models. The authors posit a rule-of-thumb: the order of dependence on ui within a layer should determine whether timescale separation occurs across directions (linear) or units (quadratic), and how width and data spectra modulate plateau durations.

A proof would provide a principled predictor for learning behavior in deep architectures, including transformers and convolutional networks, linking layer nonlinearity to stage-like dynamics and scaling trends.

References

Although a general treatment of deep network dynamics is beyond the scope of this paper, we propose a conjecture for predicting which type of timescale separation (between directions or units) arises within a layer of a deep network. We conjecture that the order of the activation function $(gin (x); ui), whether it is linear or quadratic in ui, continues to predict learning behaviors, including the type of the timescale separation and the effects of width and data distribution.

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures (2512.20607 - Zhang et al., 23 Dec 2025) in Section 7 — Deep networks