Effect of the initial configuration of weights on the training and function of artificial neural networks (2012.02550v1)

Published 4 Dec 2020 in cs.LG

Abstract: The function and performance of neural networks is largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD's ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.

Citations (13)

View on Semantic Scholar

Summary

The paper shows that SGD training often keeps weights near their initial random configuration, challenging conventional assumptions.
It finds that larger networks require smaller weight deviations to reach comparable loss values than smaller networks.
The study identifies a critical network width threshold that separates trainable from untrainable regimes using phase diagram analysis.

Effect of Initial Weight Configuration on Neural Network Training

The paper "Effect of the initial configuration of weights on the training and function of artificial neural networks" addresses the impact of initial weight configurations on the training dynamics and eventual performance of neural networks. This paper focuses on two-hidden-layer ReLU networks trained via Stochastic Gradient Descent (SGD) and provides a detailed statistical characterization of how these weights evolve from their initial random configuration.

Key Findings

The research reveals that successful SGD training often leaves networks in proximity to their initial weight configurations. This observation challenges the common assumption that significant weight adjustments are necessary to achieve optimal performance. The paper systematically examines how the distribution of weight deviations evolves during training and observes a pronounced increase in deviations within the overfitting regime, coinciding with similar abrupt changes in the loss function.

Key insights include:

Proximity to Initial Configuration: The ability of SGD to efficiently locate local minima appears restricted to regions near the initial random configuration.
Influence of Network Size: Larger networks were observed to require smaller deviations from their initial configurations to reach a comparable loss value than smaller networks. This suggests that larger networks might naturally find minima closer to their starting point.
Trainability and Untrainability Regimes: The research identifies a critical threshold in network size that separates trainable from untrainable networks. Networks below a certain width threshold diverged into untrainability during training, evidenced by substantial weight deviations and increased loss.
Phase Transition Mimicry: The paper constructs a diagram illustrating training regimes for different network widths and training times, resembling a phase diagram. This diagram identifies regions of trainability and untrainability, with wide networks remaining stable in the trainability regime indefinitely.

Implications

The findings imply that initialization plays a more pivotal role in SGD training than previously assumed, emphasizing the importance of initial weight configurations. The observation that networks effectively remain near their initial weight values suggests that initial configurations might inherently encode significant information, potentially influencing the learning dynamics profoundly.

From a theoretical standpoint, the identified proximity restriction could inform efforts to improve initialization strategies or develop new optimization techniques that leverage this insight. Practically, understanding these dynamics might lead to more efficient neural network designs and training methodologies, particularly for resource-constrained environments.

Future Directions

Future research could extend these findings by exploring how these dynamics manifest in deeper architectures or different activation functions. It would also be beneficial to investigate the interplay between initial configurations and other aspects of network training, such as regularization or learning rate adaptation.

In conclusion, this paper deepens our understanding of the subtle yet significant influence of initial weight configurations on neural network training, shedding light on the intertwined relationship between initialization, trainability, and optimization dynamics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/norabelrose/status/1844768462717292980

https://twitter.com/ceobillionaire/status/1852786781361479771

https://twitter.com/christopher/status/1844767996809564480

https://twitter.com/positronie/status/1924842278935216215

https://twitter.com/fastml_extra/status/1846926438357750133

YouTube

Show All Videos