Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity (2106.15933v2)

Published 30 Jun 2021 in stat.ML and cs.LG

Abstract: The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $\sigma2$ of the parameters at initialization $\theta_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $\gamma$ of the variance $\sigma2=w{-\gamma}$ as $w\to\infty$: for large variance ($\gamma<1$), $\theta_0$ is very close to a global minimum but far from any saddle point, and for small variance ($\gamma>1$), $\theta_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $\gamma \to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.

Citations (46)

Summary

  • The paper shows that deep linear networks transition from a linear convergence regime to a saddle-to-saddle dynamic based on the scaling of initialization variance.
  • The paper develops a theoretical framework illustrating how training dynamics drive networks towards sparse, low-rank global minima in high gamma regimes.
  • The paper supports its findings with numerical experiments and theoretical insights, paving the way for more efficient and generalizable training methods.

Saddle-to-Saddle Dynamics in Deep Linear Networks

Deep Linear Networks (DLNs) have long been considered a valuable model for understanding the theoretical foundations of Deep Neural Networks (DNNs). This paper explores the dynamics of DLNs during training, particularly focusing on the influence of parameter initialization variance on their convergence behavior. Given the pervasive usage of DNNs in machine learning applications, a deeper understanding of their training dynamics could offer significant insights.

Phase Transitions in Initialization

The authors identify a crucial phase transition in DLNs when examining the scaling factor γ\gamma of the variance σ2=wγ\sigma^2=w^{-\gamma}, where ww is the network's width. As ww \to \infty:

  • For γ<1\gamma < 1, the initialization positions the parameters close to a global minimum, barring proximity to any saddle points — this correlates with the well-known Neural Tangent Kernel (NTK) regime where convergence is linear.
  • Contrarily, for γ>1\gamma > 1, initial parameters lie near a saddle point and far from global minima. This regime is less explored and leads to a saddle-to-saddle dynamic during training, potentially optimizing the network to sparse global solutions.

Saddle-to-Saddle Dynamics

Central to this paper is the conjectured behavior within the regime where γ+\gamma \to +\infty. Here, the authors propose a novel Saddle-to-Saddle dynamic, supported by a theorem describing it between the initial two saddles. The DLN training trajectory reportedly navigates through saddles corresponding to linear mappings of increasing rank, culminating in a sparse global minimum. This offers a fresh perspective on DLN training by proposing a greedy low-rank learning algorithm that inherently biases solutions towards lower-rank configurations.

Numerical Analysis and Theoretical Backing

Numerical experiments substantiate the Saddle-to-Saddle conjecture, demonstrating the critical path between the first two rank saddles. For shallow networks, the saddle at the origin exhibits strict properties, facilitating analysis. For deeper networks, the high degeneracy of the saddle complicates training dynamics. The authors introduce theoretical insights akin to the Hartman-Grobman theorem for describing escape paths, elucidating why DLNs might naturally prefer sparse low-rank solutions.

Implications and Future Directions

This exploration into DLNs provides a theoretical underpinning for why implicit bias towards sparsity exists in gradient descent-trained networks, aligning with current literature on neural networks' incremental learning properties. Practically, understanding these dynamics could guide more efficient training regimens, reduce overfitting risk, and contribute to the development of models that generalize better even when initial parametrization scales are minuscule.

Looking forward, continuous development could harness these insights to tailor architectures for specific tasks requiring sparsity, thereby optimizing computational resources and improving generalization in large-scale DLNs. Extending these findings to non-linear networks could further revolutionize AI model training dynamics, aligning computational efficiency with task-specific performance metrics.

Youtube Logo Streamline Icon: https://streamlinehq.com