Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization (2403.07264v1)

Published 12 Mar 2024 in stat.ML and cs.LG

Abstract: We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[|{\boldsymbol{\beta}}|_{2}{2}] = \Omega(n{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Harnessing the power of infinitely wide deep nets on small-data tasks. In International Conference on Learning Representations.
  2. The spectrum of heavy tailed random matrices. Communications in Mathematical Physics, 278(3):715–751.
  3. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
  4. Spectral analysis of large dimensional random matrices, volume 20. Springer.
  5. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070.
  6. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR.
  7. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Advances in Neural Information Processing Systems, 33:2576–2586.
  8. On the sample complexity of learning with geometric stability. In Advances in Neural Information Processing Systems.
  9. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR.
  10. A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822.
  11. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):1–12.
  12. Dimension free ridge regression. arXiv preprint arXiv:2210.08571.
  13. Signal processing in large systems: A new paradigm. IEEE Signal Processing Magazine, 30(1):24–39.
  14. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, pages 10131–10143.
  15. Exact expressions for double descent and implicit regularization via surrogate random design. ArXiv, abs/1912.04533.
  16. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279.
  17. Dutka, J. (1984). The early history of the hypergeometric function. Archive for History of Exact Sciences, pages 15–34.
  18. A universal trade-off between the model size, test loss, and training loss of linear predictors. arXiv preprint arXiv:2207.11621.
  19. Eigenvalue decay implies polynomial-time learnability for neural networks. Advances in Neural Information Processing Systems, 30.
  20. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. The Annals of Statistics.
  21. Implicit regularization of random feature models. In International Conference on Machine Learning, pages 4631–4640. PMLR.
  22. Kernel alignment risk estimator: Risk prediction from training data. Advances in Neural Information Processing Systems, 33:15568–15578.
  23. Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34:1805–1817.
  24. Representations of hypergeometric functions for arbitrary parameter values and their use. Journal of Approximation Theory, 218:42–70.
  25. Generalization error without independence: Denoising, linear regression, and transfer learning. arXiv preprint arXiv:2305.17297.
  26. The uci machine learning repository. https://archive.ics.uci.edu.
  27. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352.
  28. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34:20657–20668.
  29. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks. arXiv preprint arXiv:2212.13848.
  30. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347.
  31. Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pages 4284–4293. PMLR.
  32. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Advances in Neural Information Processing Systems.
  33. Characterizing the spectrum of the NTK via a power series expansion. arXiv preprint arXiv:2211.07844.
  34. Optimal regularization can mitigate double descent. In International Conference on Learning Representations.
  35. Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pages 3548–3626. PMLR.
  36. Halting time is predictable for large models: A universality property and average-case analysis. Foundations of Computational Mathematics, pages 1–77.
  37. Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):295–309.
  38. The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research.
  39. Under-parameterized double descent for ridge regularized least squares denoising of data on a line. arXiv preprint arXiv:2305.14689.
  40. Training data size induced double descent for denoising feedforward neural networks and the role of training noise. Transactions on Machine Learning Research.
  41. Tao, T. (2011). Intuitive understanding of the Stieltjes transform. MathOverflow. Version: 2011-10-25.
  42. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286.
  43. Explicit loss asymptotics in the gradient descent training of neural networks. Advances in Neural Information Processing Systems, 34:2570–2582.
  44. Tight convergence rate bounds for optimization under power law spectral conditions. arXiv preprint arXiv:2202.00992.
  45. Spectral evolution and invariance in linear-width neural networks. Advances in Neural Information Processing Systems, 36.
  46. More than a toy: Random matrix models predict how real-world neural representations generalize. In Proceedings of the 39th International Conference on Machine Learning, pages 23549–23588. PMLR.
  47. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123.
  48. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  49. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
  50. How many data are needed for robust learning? arXiv preprint arXiv:2202.11592.
Citations (4)

Summary

We haven't generated a summary for this paper yet.