Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization (2403.07264v1)
Abstract: We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[|{\boldsymbol{\beta}}|_{2}{2}] = \Omega(n{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.
- Harnessing the power of infinitely wide deep nets on small-data tasks. In International Conference on Learning Representations.
- The spectrum of heavy tailed random matrices. Communications in Mathematical Physics, 278(3):715–751.
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
- Spectral analysis of large dimensional random matrices, volume 20. Springer.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070.
- To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR.
- Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Advances in Neural Information Processing Systems, 33:2576–2586.
- On the sample complexity of learning with geometric stability. In Advances in Neural Information Processing Systems.
- Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR.
- A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822.
- Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):1–12.
- Dimension free ridge regression. arXiv preprint arXiv:2210.08571.
- Signal processing in large systems: A new paradigm. IEEE Signal Processing Magazine, 30(1):24–39.
- Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, pages 10131–10143.
- Exact expressions for double descent and implicit regularization via surrogate random design. ArXiv, abs/1912.04533.
- High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279.
- Dutka, J. (1984). The early history of the hypergeometric function. Archive for History of Exact Sciences, pages 15–34.
- A universal trade-off between the model size, test loss, and training loss of linear predictors. arXiv preprint arXiv:2207.11621.
- Eigenvalue decay implies polynomial-time learnability for neural networks. Advances in Neural Information Processing Systems, 30.
- Surprises in High-Dimensional Ridgeless Least Squares Interpolation. The Annals of Statistics.
- Implicit regularization of random feature models. In International Conference on Machine Learning, pages 4631–4640. PMLR.
- Kernel alignment risk estimator: Risk prediction from training data. Advances in Neural Information Processing Systems, 33:15568–15578.
- Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34:1805–1817.
- Representations of hypergeometric functions for arbitrary parameter values and their use. Journal of Approximation Theory, 218:42–70.
- Generalization error without independence: Denoising, linear regression, and transfer learning. arXiv preprint arXiv:2305.17297.
- The uci machine learning repository. https://archive.ics.uci.edu.
- Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352.
- Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34:20657–20668.
- Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks. arXiv preprint arXiv:2212.13848.
- Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347.
- Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pages 4284–4293. PMLR.
- Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Advances in Neural Information Processing Systems.
- Characterizing the spectrum of the NTK via a power series expansion. arXiv preprint arXiv:2211.07844.
- Optimal regularization can mitigate double descent. In International Conference on Learning Representations.
- Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pages 3548–3626. PMLR.
- Halting time is predictable for large models: A universality property and average-case analysis. Foundations of Computational Mathematics, pages 1–77.
- Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):295–309.
- The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research.
- Under-parameterized double descent for ridge regularized least squares denoising of data on a line. arXiv preprint arXiv:2305.14689.
- Training data size induced double descent for denoising feedforward neural networks and the role of training noise. Transactions on Machine Learning Research.
- Tao, T. (2011). Intuitive understanding of the Stieltjes transform. MathOverflow. Version: 2011-10-25.
- Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286.
- Explicit loss asymptotics in the gradient descent training of neural networks. Advances in Neural Information Processing Systems, 34:2570–2582.
- Tight convergence rate bounds for optimization under power law spectral conditions. arXiv preprint arXiv:2202.00992.
- Spectral evolution and invariance in linear-width neural networks. Advances in Neural Information Processing Systems, 36.
- More than a toy: Random matrix models predict how real-world neural representations generalize. In Proceedings of the 39th International Conference on Machine Learning, pages 23549–23588. PMLR.
- On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
- How many data are needed for robust learning? arXiv preprint arXiv:2202.11592.