Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks (2501.09137v2)

Published 15 Jan 2025 in cs.LG, math.OC, and stat.ML

Abstract: We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize -- about $2/\textrm{sharpness}$. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability'', which induces additional regularization by delaying convergence and may have implications for training more complex models.

Summary

The paper demonstrates that gradient descent converges linearly to flatter, lower-norm minima compared to gradient flow in shallow linear networks.
GD achieves flatter minima through implicit regularization that shrinks parameter imbalance and norms, contrasting with GF which conserves these properties.
A key trade-off is identified: larger step sizes strengthen implicit regularization and yield smoother minima but slow the overall convergence rate.

The paper "Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks" investigates the dynamics of gradient descent (GD) in a depth-2 linear neural network, characterized by a single input and output node. Notably, the authors analyze how GD converges to minimized training loss solutions more effectively than gradient flow (GF) in terms of flatness of minima. Here is an in-depth summary of the key findings and implications:

Key Contributions:

Convergence Rate of Gradient Descent:
- The authors demonstrate that GD converges to a global minimum at a linear rate, even when the learning rate is large, provided it is kept under approximately $2/\text{sharpness}$ , where sharpness refers to the largest eigenvalue of the Hessian. This rate is particularly quantifiable and is linearly dependent on the learning rate $\eta$ , initial parameters, and the target value $\Phi$ . If $\eta$ increases past the threshold $\eta > 2/\text{sharpness}$ , convergence can occur but at a much slower rate.
Location of Convergence:
- GD converges to minima with lower norm and sharpness than GF, attributed to implicit regularization shrinking the imbalance, a measure of parameter distribution equality represented as $Q := \sum_i |a_i^2 - b_i^2|$ . This is in contrast to GF which conserves the imbalance $Q$ .
Trade-off between Convergence Speed and Regularization:
- The analysis unveils a critical trade-off: stronger implicit regularization achieved through larger stepsizes slows the convergence rate, while faster convergence is accompanied by weaker regularization. The results indicate that overly large learning rates encourage smoother minima, which are often associated with better generalization in practice.
Implications for Large Step-size Training:
- The paper connects the behavior of GD at larger step sizes to the "Edge of Stability" phenomenon, where convergence can still be maintained even if the largest eigenvalue of the Hessian surpasses $2/\eta$ .
Empirical Evidence:
- Through a series of illustrative experiments, the authors validate their theoretical findings, showing how GD behaves under different learning rates and initializations. They demonstrate how varying initial conditions and learning rates impact the convergence path and the sharpness of the resulting minima.

Technical Insights and Formal Analysis:

The dynamics of GD are systematically explored using auxiliary parameters: residuals, imbalances, and parameter norms, offering a more tractable approach than direct parameter tracking.
Convergence is partitioned into different regions based on the sign and magnitude of residuals, with explicit formulations for each region aiding in predicting behavior.
In-depth mathematical analysis includes leveraging a modified Polyak-Łojasiewicz (PL) condition along the GD trajectory for both speed of convergence and characterization of solution regularization.

Broader Impact and Future Directions:

By distilling the self-multiplying parameter dynamics of complex models, the authors provide critical insights for academic and applied settings, suggesting that the choice between GD and GF extends beyond computational efficiency to the quality and generalizability of solutions.
The exploration of implicit regularization mechanisms serves as a guide for developing more principled strategies in neural network training, particularly in scenarios where minimizing sharpness becomes a priority.
The findings also direct future work towards exploring these dynamics in deeper and more intricate neural architectures, potentially incorporating nonlinear elements such as ReLU activations.

Overall, this work makes significant contributions towards understanding the nuances of optimization dynamics in machine learning, offering detailed mathematical and empirical analyses that inform best practices in neural network training.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks (2501.09137v2)

Summary

Key Contributions:

Technical Insights and Formal Analysis:

Broader Impact and Future Directions:

Follow-up Questions

Authors (2)

Tweets

Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks (2501.09137v2)

Summary

Key Contributions:

Technical Insights and Formal Analysis:

Broader Impact and Future Directions:

Follow-up Questions

Related Papers

Authors (2)

Tweets