Timescale separation in higher-degree homogeneous activations

Determine whether two-layer networks in which the unit activation function is a homogeneous polynomial of degree p > 2 in the unit weights exhibit a timescale separation between units under gradient flow from small initialization, and ascertain whether this separation is stronger than in the quadratic (p = 2) case.

Background

Section 5.2 analyzes two-layer networks where the activation is quadratic in the unit weights and proves a timescale separation between units due to differences in initialization. The authors hypothesize that similar, potentially stronger, effects should hold for higher-degree homogeneous polynomials, but they do not provide a proof.

Confirming this conjecture would generalize the proposed mechanism for saddle-to-saddle dynamics beyond quadratic activations and clarify how activation degree governs unit growth rates and plateau formation.

References

If +(x; u) is a homogeneous polynomial of degree p > 2 in the weights u, we conjecture that there is still a timescale separation between units, possibly even stronger than the quadratic case.

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures (2512.20607 - Zhang et al., 23 Dec 2025) in Section 5.2 (Quadratic case: timescale separation between units) — Higher-order polynomial activation