- The paper establishes that infinite overparameterization with optimal ridge penalty minimizes test risk in RF regression and kernel ridge regression.
- The methodology provides closed-form risk estimates, demonstrating that larger models with more data consistently yield superior performance.
- The study reveals that overfitting, typically undesired, is essential in powerlaw tasks, prompting a re-evaluation of model complexity in ML.
More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory
James B. Simon and collaborators investigate a prevalent assertion in modern machine learning: more is always advantageous. Their paper articulates the premise that larger models with broader datasets and increased computational resources generally achieve superior performance, focusing on random feature (RF) regression as a representative model for shallow networks. This paper provides a thorough theoretical underpinning for the empirical successes noted with overparameterized models by illustrating how and why infinite width and minimal regularization yield optimal results in RF regression and kernel ridge regression (KRR).
The authors start with a rigorous examination of RF regression, showing that the test risk decreases as the number of features and samples increases, provided the optimal ridge penalty is selected. An intriguing implication of this analysis is that infinite width RF models are preferable over any finite width configurations, reinforcing the assertion that "more is better" in terms of feature count. Moreover, their analysis identifies that in tasks characterized by a powerlaw eigenstructure, RF models achieve near-optimal performance only when the training error is much smaller than the test error, implying that overfitting—traditionally seen as undesirable—is necessary.
The paper's claims are grounded in real-world datasets and validated with computer vision tasks employing convolutional neural tangent kernels. These experimental evaluations align with the theoretical predictions, demonstrating that infinite-feature models not only offer practical benefits but do so by exploiting the structure of powerlaw tasks where the eigenvalues and coefficients decay polynomially. They provide evidence that these models attain optimal regularization when the fitting ratio of training to test error remains substantially less than one, an insight that challenges conventional understanding.
This paper proclaims the obligatory nature of overfitting in certain scenarios, specifically within the framework of tasks characterized by powerlaw eigenstructure. The authors develop a criterion dictating that for optimal test performance, the fitting ratio should be much smaller than one, showcasing that large models with low regularization are necessitated by the task's inherent data structure rather than being anomalies or exceptions.
The authors' contributions include:
- Derivation of general closed-form estimates for RF regression test risks.
- Establishing that increased features and data consistently yield better performance.
- Detailed investigation into obligatory overfitting for powerlaw tasks, revealing that small or zero regularization is optimal.
- Empirical validation using standard computer vision datasets, affirming the criteria for obligatory overfitting.
The paper positions its findings within the broader context of machine learning and proposes that rather than adopting parsimonious models as classical statistics advocated, we should embrace overparameterization for its benefits in approximating an idealized limit. The results suggest a paradigm shift: understanding that models' inclination towards complexity under limited regularization aligns with realistic task difficulty rather than being a detour into fitting noise.
Looking forward, these results could inspire fresh theoretical frameworks for analyzing the behaviors and benefits of deep learning models. There is also scope for exploring how these principles extend to more complex architectures beyond shallow networks. Furthermore, understanding the granularity of when and why overfitting becomes necessary can inform better model selection and training practices, potentially guiding the development of models that align with specific task structures inherent in realistic data distributions.