Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes (2406.06838v1)

Published 10 Jun 2024 in cs.LG, cs.AI, and stat.ML

Abstract: We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate $\eta$ can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by $1/\eta - 1/2 + \widetilde{O}(\sigma + \sqrt{\mathrm{MSE}})$ where $\sigma$ is the label noise level, $\mathrm{MSE}$ is short for mean squared error against the ground truth, and $\widetilde{O}(\cdot)$ hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of $\widetilde{O}(n^{-4/5})$ within the strict interior of the support of the $n$ data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.

Authors (5)

Dan Qiao (26 papers)
Kaiqi Zhang (19 papers)
Esha Singh (3 papers)
Daniel Soudry (76 papers)
Yu-Xiang Wang (124 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that large step sizes in gradient descent yield stable, non-interpolating minima with bounded first-order total variation.
It reveals that training below the edge-of-stability regime enables near-optimal MSE and robust generalization in noisy label environments.
The findings emphasize a practical approach to outperform kernel methods by avoiding overfitting through implicit regularization in ReLU networks.

Essay: Stability and Generalization in Univariate ReLU Networks Trained with Gradient Descent

Neural network generalization under noisy labels is a nuanced area of paper, set against the backdrop of high-dimensional and overparametrized models. The paper "Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes" by Dan Qiao et al. seeks to elucidate the dynamics of generalization in two-layer ReLU networks trained via gradient descent. This work specifically addresses the learning dynamics when training with a constant step size in a univariate regression setup with noisy labels.

The main inquiry of the paper is whether gradient descent-trained neural networks can settle into local minima that do not overfit, even in a noisy label scenario where simpler kernel methods fail. The authors argue that stable minima, in this context, generalize well. These minima result from large learning rates and exhibit bounded first-order total variation, invoking an implicit bias against overfitting.

Key Findings and Contributions

The key contributions of the paper are multifold:

Non-Interpolation and Complexity Bound:
- The paper presents a counter-example that demonstrates the necessity for large learning rates by showing that interpolation of noisy labels results in extremely sharp local minima. Consequently, any stable minima in such settings cannot be interpolating solutions.
- The authors establish a constraint on the first-order total variation (TV1) for solutions that gradient descent stabilizes upon, even without explicit regularization. Specifically, they derive an upper bound on weighted TV1 as a function of the learning rate $\eta$ , the noise level $\sigma$ , and the mean squared error (MSE) against the ground truth.
Minima Stability and Generalization:
- The researchers show that gradient descent, when operating below the Edge-of-Stability (BEoS) regime, inherently favors solutions with bounded TV1. This is rigorously backed by empirical evidence and theoretical bounds.
- By analyzing the convergence properties of gradient descent with large step sizes, they provide a theoretical foundation for why such models generalize well on nonparametric regression tasks in the univariate case.
Generalization Gap and Optimal MSE Bounds:
- The authors establish that stable minima have a vanishing generalization gap on the intervals strictly within the data support. They show that the learned ReLU networks do not overfit, achieving near-optimal rates for MSE, outperforming kernel ridge regression in certain settings.
- They further distinguish their results from existing theories tied to benign overfitting and kernel regimes, emphasizing that gradient descent in non-interpolating scenarios reaches solutions within the rich'' orfeature-learning'' regime.

Implications and Future Directions

The theoretical implications of this paper are significant, particularly in illuminating the conditions under which neural networks generalize effectively in noisy environments. The framework they establish for analyzing stability and implicit bias lends itself well not just to simple ReLU networks, but holds potential extension to deeper architectures and more complex datasets.

Practical implications are stark; for practitioners, the emphasis on large step sizes during training offers guidance on avoiding overfitting in noisy data scenarios. This is particularly relevant as traditional methods like weight decay or early stopping may become cumbersome or insufficient.

Critiques and Limitations

While the paper provides substantial theoretical backing and numerical validation, certain areas remain open for further investigation. The linear stability and BEoS assumptions rely heavily on twice-differentiability, an assumption that may not always hold or be straightforward to guarantee in more complex models or architectures. Additionally, the results focus on univariate inputs, and extending these findings to higher-dimensional data remains an essential, yet challenging, task. The authors themselves note the computational aspects; without guarantees on efficient convergence, the practical applicability to deep learning remains partially theoretical.

Conclusion

The research presented by Qiao et al. delivers substantial insights into the generalization mechanisms of gradient descent-trained neural networks under noisy labels. By demonstrating that stable minima induced by large step sizes achieve both empirical and theoretical generalization bounds, this paper carves a nuanced understanding of implicit biases in neural network training. These findings not only critique the limitations of kernel-based methods and benign overfitting theorems but also lay the groundwork for more robust neural network training methodologies in the presence of label noise. The prospects for future research are promising, with potential explorations into multivariate functions, deeper networks, and adaptive learning mechanisms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yuxiangw_cs/status/1801183065005556165

https://twitter.com/StatMLPapers/status/1800740918296822183

https://twitter.com/lzy_michael/status/1800827143854731619

https://twitter.com/burny_tech/status/1802445707980231062