Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory (2311.14646v4)

Published 24 Nov 2023 in cs.LG and stat.ML

Abstract: In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.

Citations (12)

Summary

  • The paper establishes that infinite overparameterization with optimal ridge penalty minimizes test risk in RF regression and kernel ridge regression.
  • The methodology provides closed-form risk estimates, demonstrating that larger models with more data consistently yield superior performance.
  • The study reveals that overfitting, typically undesired, is essential in powerlaw tasks, prompting a re-evaluation of model complexity in ML.

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

James B. Simon and collaborators investigate a prevalent assertion in modern machine learning: more is always advantageous. Their paper articulates the premise that larger models with broader datasets and increased computational resources generally achieve superior performance, focusing on random feature (RF) regression as a representative model for shallow networks. This paper provides a thorough theoretical underpinning for the empirical successes noted with overparameterized models by illustrating how and why infinite width and minimal regularization yield optimal results in RF regression and kernel ridge regression (KRR).

The authors start with a rigorous examination of RF regression, showing that the test risk decreases as the number of features and samples increases, provided the optimal ridge penalty is selected. An intriguing implication of this analysis is that infinite width RF models are preferable over any finite width configurations, reinforcing the assertion that "more is better" in terms of feature count. Moreover, their analysis identifies that in tasks characterized by a powerlaw eigenstructure, RF models achieve near-optimal performance only when the training error is much smaller than the test error, implying that overfitting—traditionally seen as undesirable—is necessary.

The paper's claims are grounded in real-world datasets and validated with computer vision tasks employing convolutional neural tangent kernels. These experimental evaluations align with the theoretical predictions, demonstrating that infinite-feature models not only offer practical benefits but do so by exploiting the structure of powerlaw tasks where the eigenvalues and coefficients decay polynomially. They provide evidence that these models attain optimal regularization when the fitting ratio of training to test error remains substantially less than one, an insight that challenges conventional understanding.

This paper proclaims the obligatory nature of overfitting in certain scenarios, specifically within the framework of tasks characterized by powerlaw eigenstructure. The authors develop a criterion dictating that for optimal test performance, the fitting ratio should be much smaller than one, showcasing that large models with low regularization are necessitated by the task's inherent data structure rather than being anomalies or exceptions.

The authors' contributions include:

  • Derivation of general closed-form estimates for RF regression test risks.
  • Establishing that increased features and data consistently yield better performance.
  • Detailed investigation into obligatory overfitting for powerlaw tasks, revealing that small or zero regularization is optimal.
  • Empirical validation using standard computer vision datasets, affirming the criteria for obligatory overfitting.

The paper positions its findings within the broader context of machine learning and proposes that rather than adopting parsimonious models as classical statistics advocated, we should embrace overparameterization for its benefits in approximating an idealized limit. The results suggest a paradigm shift: understanding that models' inclination towards complexity under limited regularization aligns with realistic task difficulty rather than being a detour into fitting noise.

Looking forward, these results could inspire fresh theoretical frameworks for analyzing the behaviors and benefits of deep learning models. There is also scope for exploring how these principles extend to more complex architectures beyond shallow networks. Furthermore, understanding the granularity of when and why overfitting becomes necessary can inform better model selection and training practices, potentially guiding the development of models that align with specific task structures inherent in realistic data distributions.