Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benign overfitting in leaky ReLU networks with moderate input dimension (2403.06903v3)

Published 11 Mar 2024 in cs.LG and stat.ML

Abstract: The problem of benign overfitting asks whether it is possible for a model to perfectly fit noisy training data and still generalize well. We study benign overfitting in two-layer leaky ReLU networks trained with the hinge loss on a binary classification task. We consider input data that can be decomposed into the sum of a common signal and a random noise component, that lie on subspaces orthogonal to one another. We characterize conditions on the signal to noise ratio (SNR) of the model parameters giving rise to benign versus non-benign (or harmful) overfitting: in particular, if the SNR is high then benign overfitting occurs, conversely if the SNR is low then harmful overfitting occurs. We attribute both benign and non-benign overfitting to an approximate margin maximization property and show that leaky ReLU networks trained on hinge loss with gradient descent (GD) satisfy this property. In contrast to prior work we do not require the training data to be nearly orthogonal. Notably, for input dimension $d$ and training sample size $n$, while results in prior work require $d = \Omega(n2 \log n)$, here we require only $d = \Omega\left(n\right)$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  74–84. PMLR, 2020. URL https://proceedings.mlr.press/v119/adlam20a.html.
  2. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. URL https://www.pnas.org/doi/abs/10.1073/pnas.1907378117.
  3. To understand deep learning we need to understand kernel learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  541–549. PMLR, 2018. URL https://proceedings.mlr.press/v80/belkin18a.html.
  4. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. URL https://doi.org/10.1073/pnas.190307011.
  5. SGD learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJ33wwxRb.
  6. Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  8407–8418. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/46e0eae7d5217c79c3ef6b4c212b8c6f-Paper.pdf.
  7. Benign overfitting in two-layer convolutional neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  25237–25250. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a12c999be280372b157294e72a4bbc8b-Paper-Conference.pdf.
  8. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research, 22(129):1–30, 2021. URL http://jmlr.org/papers/v22/20-974.html.
  9. Foolish crowds support benign overfitting. Journal of Machine Learning Research, 23(125):1–12, 2022. URL http://jmlr.org/papers/v23/21-1199.html.
  10. Why does sharpness-aware minimization generalize better than SGD? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=3WAnGWLpSQ.
  11. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.  2668–2703. PMLR, 2022. URL https://proceedings.mlr.press/v178/frei22a.html.
  12. Benign overfitting in linear classifiers and leaky relu networks from kkt conditions for margin maximization. In Gergely Neu and Lorenzo Rosasco (eds.), The Thirty Sixth Annual Conference on Learning Theory, 12-15 July 2023, Bangalore, India, volume 195 of Proceedings of Machine Learning Research, pp.  3173–3228. PMLR, 2023. URL https://proceedings.mlr.press/v195/frei23a.html.
  13. Training shallow ReLU networks on noisy data using hinge loss: when do we overfit and is it benign? In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  35139–35189. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6e73c39cc428c7d264d9820319f31e79-Paper-Conference.pdf.
  14. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022. URL https://doi.org/10.1214/21-AOS2133.
  15. Directional convergence and alignment in deep learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  17176–17186. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/c76e4b2fa54f8506719a5c0dc14c2eb9-Paper.pdf.
  16. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=FyOhThdDBM.
  17. From tempered to benign overfitting in ReLU neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=LnZuxp3Tx7.
  18. Benign overfitting in two-layer ReLU convolutional neural networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17615–17659. PMLR, 2023. URL https://proceedings.mlr.press/v202/kou23a.html.
  19. Just interpolate: Kernel “Ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329 – 1347, 2020. URL https://doi.org/10.1214/19-AOS1849.
  20. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  2683–2711. PMLR, 2020. URL https://proceedings.mlr.press/v125/liang20a.html.
  21. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJeLIgBKPS.
  22. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.22008.
  23. Universality of max-margin classifiers. arXiv preprint arXiv:2310.00176, 2023a.
  24. The generalization error of max-margin linear classifiers: Benign overfitting and high-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2023b.
  25. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020. URL https://doi.org/10.1109/JSAIT.2020.2984716.
  26. Classification vs regression in overparameterized regimes: Does the loss function matter? J. Mach. Learn. Res., 22(1), 2021. URL http://jmlr.org/papers/v22/20-603.html.
  27. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707–1739, 2009. URL https://doi.org/10.1002/cpa.20294.
  28. Ohad Shamir. The implicit bias of benign overfitting. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.  448–478. PMLR, 2022. URL https://proceedings.mlr.press/v178/shamir22a.html.
  29. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. URL https://doi.org/10.1017/9781108231596.
  30. Tight bounds for minimum ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm interpolation of noisy data. In International Conference on Artificial Intelligence and Statistics, 2021a. URL https://proceedings.mlr.press/v151/wang22k.html.
  31. Benign overfitting in multiclass classification: All roads lead to interpolation. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  24164–24179. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/caaa29eab72b231b0af62fbdff89bfce-Paper.pdf.
  32. Denny Wu and Ji Xu. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  10112–10123. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/72e6d3238361fe70f22fb0ac624a7072-Paper.pdf.
  33. Benign overfitting of non-smooth neural networks beyond lazy training. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp.  11094–11117. PMLR, 2023. URL https://proceedings.mlr.press/v206/xu23k.html.
  34. Benign overfitting and grokking in reLU networks for XOR cluster data. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=BxHgpC6FNv.
  35. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.
  36. Benign overfitting of constant-stepsize SGD for linear regression. In Mikhail Belkin and Samory Kpotufe (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp.  4633–4635. PMLR, 2021. URL https://proceedings.mlr.press/v134/zou21a.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kedar Karhadkar (7 papers)
  2. Erin George (6 papers)
  3. Michael Murray (18 papers)
  4. Deanna Needell (155 papers)
  5. Guido Montúfar (40 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com