Understanding Gradient Descent through the Training Jacobian (2412.07003v2)

Published 9 Dec 2024 in cs.LG

Abstract: We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian

Summary

The paper analyzes neural network training geometry by examining the singular value spectrum of the training Jacobian, which links initial and final parameters under gradient descent.
It identifies three spectral regions having distinct effects on parameter perturbations during training: chaotic (singular values > 1), bulk (singular values ~ 1), and stable (singular values < 1).
Experiments show a prevalent and robust bulk region, suggesting that SGD operates within a low-dimensional subspace largely determined by data structure, not initialization or labels.

Understanding Gradient Descent Through the Training Jacobian

The paper provides an analytical perspective on the geometry of neural network training, focusing on the singular value spectrum of the Jacobian matrix. This Jacobian, termed the training Jacobian, describes the relationship between initial and final parameters of a trained neural network, particularly under gradient descent optimization.

The research identifies three distinct spectral regions within the singular value spectrum of the training Jacobian: the chaotic, bulk, and stable regions. The chaotic region is characterized by singular values significantly greater than one, indicating non-convexity in the optimization landscape. The bulk region, where singular values are extremely close to one, suggests that perturbations to initial parameters are carried through training unchanged, revealing an intrinsic geometry largely invariant to the labels and initializations but sensitive to data distribution. Lastly, the stable region has singular values less than one, consistent with traditional convex optimization settings where the optimizer dampens parameter perturbations.

Experimentally, the authors employ small neural networks to compute the training Jacobian using JAX's forward-mode automatic differentiation. Their analysis, conducted on an MLP trained on the UCI digits dataset, reveals that approximately two-thirds of the parameter space falls into the bulk, marginalizing the effect of perturbations on the training data. Further, they demonstrate that these bulk directions have a muted influence on network output when evaluated on in-distribution data compared to far out-of-distribution scenarios, such as inputs comprising random noise.

Results indicate that the bulk subspace is surprisingly robust across different training initializations and even when training with shuffled labels. This aligns with the hypothesis that the bulk is a result of data structure, suggesting that SGD navigates a low-dimensional subspace to optimize network performance. Conversely, training constrained to the chaotic or stable subspaces still achieves successful optimization, supporting the conjecture that these regions are more critical to the training dynamics than the bulk.

The findings are nuanced by extending spectral analysis to larger models like LeNet-5 on MNIST. The effort to compute the singular value decomposition highlights the computational challenges of this approach, finding a similar bulk region as observed in smaller models, thus reinforcing the scalability of their observations.

In conclusion, the paper furnishes new insights into neural network training by elucidating the roles of chaotic, bulk, and stable regions of the training Jacobian. It paves the way for investigating the structural aspects of parameter space, providing a deeper understanding of SGD's behavior across various data types. Future research could focus on efficiently computing the training Jacobian in larger-scale models, potentially utilizing novel randomized linear algebra techniques to further disentangle the intrinsic dimensionality that dictates the trainability and generalization capabilities of neural networks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/norabelrose/status/1866943688993370381

https://twitter.com/s_scardapane/status/1884206476912410656

https://twitter.com/rohanpaul_ai/status/1868073599023902879

https://twitter.com/ScientistIzaak/status/1875075986120937809

https://twitter.com/fly51fly/status/1866962329658909028

https://twitter.com/rohanpaul_ai/status/1874430953386610787