Activation-order predictor for timescale separation and scaling effects in deep layers
Establish whether, in deep networks with layers of the form f(x; 01:H) = gout(∑i φ(gin(x); ui)vi), the order of the layer’s dependence on unit-specific weights ui in φ(gin(x); ui) (linear versus quadratic) predicts the type of timescale separation (between directions versus between units) in that layer’s gradient flow dynamics and determines how layer width and data distribution affect learning plateaus.
Sponsor
References
Although a general treatment of deep network dynamics is beyond the scope of this paper, we propose a conjecture for predicting which type of timescale separation (between directions or units) arises within a layer of a deep network. We conjecture that the order of the activation function $(gin (x); ui), whether it is linear or quadratic in ui, continues to predict learning behaviors, including the type of the timescale separation and the effects of width and data distribution.