From Tempered to Benign Overfitting in ReLU Neural Networks (2305.15141v3)
Abstract: Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.
- Uniform convergence, adversarial spheres and a simple remedy. In International Conference on Machine Learning, pages 490–499. PMLR, 2021.
- Keith Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31(1-58):26, 1997.
- Failures of model-dependent generalization bounds for least-norm interpolation. The Journal of Machine Learning Research, 22(1):9297–9311, 2021.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
- On the inconsistency of kernel ridgeless regression in fixed dimensions. SIAM Journal on Mathematics of Data Science, 5(4):854–872, 2023.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
- Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. Advances in neural information processing systems, 31, 2018a.
- To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018b.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
- Penalising the biases in norm regularisation enforces sparsity. arXiv preprint arXiv:2303.01353, 2023.
- Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems, 34:8407–8418, 2021.
- Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems, 35:25237–25250, 2022.
- Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. The Journal of Machine Learning Research, 22(1):5721–5750, 2021.
- Deep linear networks can benignly overfit when shallow ones do. Journal of Machine Learning Research, 24(117):1–39, 2023.
- F. H. Clarke. Optimization and Nonsmooth Analysis. SIAM, 1990.
- Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
- Approximate kkt points and a proximity measure for termination. Journal of Global Optimization, 56(4):1463–1499, 2013.
- Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
- Benign overfitting in linear classifiers and leaky relu networks from kkt conditions for margin maximization. arXiv preprint arXiv:2303.01462, 2023.
- Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
- Lars Holst. On the lengths of the pieces of a stick broken at random. Journal of Applied Probability, 17(3):623–634, 1980.
- Understanding square loss in training overparametrized neural network classifiers. Advances in Neural Information Processing Systems, 35:16495–16508, 2022.
- Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176–17186, 2020.
- Noisy interpolation learning with shallow univariate relu networks. arXiv preprint arXiv:2307.15396, 2023.
- Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34:20657–20668, 2021.
- Benign overfitting for two-layer relu networks. arXiv preprint arXiv:2303.04145, 2023.
- Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020.
- Interpolating classifiers make few mistakes. Journal of Machine Learning Research, 24(20):1–27, 2023.
- Gradient descent maximizes the margin of homogeneous neural networks. In 8th International Conference on Learning Representations (ICLR), 2020.
- Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182–1195, 2022.
- Interpolation learning with minimum description length. arXiv preprint arXiv:2302.07263, 2023.
- Harmless interpolation in regression and classification with structured features. In International Conference on Artificial Intelligence and Statistics, pages 5853–5875. PMLR, 2022.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.
- Classification vs regression in overparameterized regimes: Does the loss function matter? The Journal of Machine Learning Research, 22(1):10104–10172, 2021.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- Distributional generalization: A new kind of generalization. arXiv preprint arXiv:2009.08092, 2020.
- In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors. In International Conference on Machine Learning, pages 7263–7272. PMLR, 2020.
- Generalization in deep network classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo, 112, 2019.
- Ronald Pyke. Spacings. Journal of the Royal Statistical Society: Series B (Methodological), 27(3):395–436, 1965.
- Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019.
- On the effective number of linear regions in shallow univariate relu networks: Convergence guarantees and implicit bias. Advances in Neural Information Processing Systems, 35, 2022.
- How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690. PMLR, 2019.
- Ohad Shamir. The implicit bias of benign overfitting. Journal of Machine Learning Research, 24(113):1–40, 2023.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Theoretical insights into multiclass classification: A high-dimensional asymptotic view. Advances in Neural Information Processing Systems, 33:8907–8920, 2020.
- Gal Vardi. On the implicit bias in deep-learning algorithms. arXiv preprint arXiv:2208.12591, 2022.
- On margin maximization in linear and relu networks. Advances in Neural Information Processing Systems, 35:37024–37036, 2022.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Ke Wang and Christos Thrampoulidis. Benign overfitting in binary classification of gaussian mixtures. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4030–4034. IEEE, 2021.
- Benign overfitting in multiclass classification: All roads lead to interpolation. Advances in Neural Information Processing Systems, 34:24164–24179, 2021.
- Benign overfitting of non-smooth neural networks beyond lazy training. In International Conference on Artificial Intelligence and Statistics, pages 11094–11117. PMLR, 2023.
- Exact gap between generalization error and uniform convergence in random feature models. In International Conference on Machine Learning, pages 11704–11715. PMLR, 2021.
- Understanding deep learning requires rethinking generalization. 5th International Conference on Learning Representations, ICLR, 2017.
- Optimistic rates: A unifying theory for interpolation learning and regularization in linear regression. ACM/IMS Journal of Data Science, 1, 2023.