No Free Lunch From Random Feature Ensembles

Published 6 Dec 2024 in cs.LG, cond-mat.dis-nn, and stat.ML | (2412.05418v1)

Abstract: Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks. We study this trade-off for ensembles of random-feature ridge regression models. We prove that when a fixed number of trainable parameters are partitioned among $K$ independently trained models, $K=1$ achieves optimal performance, provided the ridge parameter is optimally tuned. We then derive scaling laws which describe how the test risk of an ensemble of regression models decays with its total size. We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws. Training ensembles of deep convolutional neural networks on CIFAR-10 and a transformer architecture on C4, we find that a single large network outperforms any ensemble of networks with the same total number of parameters, provided the weight decay and feature-learning strength are tuned to their optimal values.

Abstract PDF HTML Chat (Pro)

Summary

The paper demonstrates that under a fixed parameter budget, a single large model tuned with ridge regression minimizes test risk compared to ensembles.
The paper derives scaling laws that quantify how model size impacts performance, revealing diminishing ensemble benefits at optimal tuning.
The paper validates its theory with experiments on CIFAR-10 and MNIST, showing that a singular, well-tuned model consistently outperforms ensemble predictors.

No Free Lunch from Random Feature Ensembles

The paper under discussion provides a comprehensive examination of the trade-offs between ensemble size and model dimensionality within the framework of Random-Feature Ridge Regression (RFRR). The authors rigorously explore the fundamental question of whether ensembles offer tangible benefits under a fixed parameter budget and derive substantial theoretical insights applicable to both random-feature models and deep learning paradigms. The salient conclusion is that the optimal performance – as measured through test risk – is achieved with a solitary model utilizing all available parameters, contingent upon optimal tuning of the ridge parameter. This work challenges the conventional wisdom surrounding ensembles by categorically stating that for fixed total parameter budgets, larger models outperform ensembles when model hyperparameters are optimally configured.

Core Contributions and Findings

Model Ensemble Trade-offs: The authors address the critical trade-off that exists between the number of predictors and the size of each predictor in model ensembles, particularly for RFRR. They establish that given a fixed total parameter count, a large, singular model is superior to any ensemble of smaller models in terms of minimizing error.
Scaling Laws: To elucidate the relationship between model size and error, the paper derives scaling laws that describe the behavior of RFRR ensembles under source and capacity constraints. These scaling laws delineate regimes in which ensembles can attain near-optimal performance, yet they concretely affirm the primacy of a single large model at optimal tuning settings.
Empirical Verification: Extensive experiments on RFRR models applied to CIFAR-10 and MNIST datasets validate the theoretical predictions. The analysis underscores how the performance of larger singular models consistently surpasses that of any ensembles when constrained by total parameter size and appropriately tuned.
Deep Learning Generalization: Through analogy with deep convolutional neural networks and transformers, the authors extend their findings to conventional deep learning models, demonstrating empirically that singular large networks outperform ensembles of smaller networks in computer vision and language modeling tasks when hyperparameters, such as weight decay and richness, are tuned optimally.

Theoretical and Practical Implications

Theoretically, this paper contributes to our understanding of variance in learned models and the implications of resource allocation within ensembles. The reach of the "no free lunch" theorem highlights the importance of hyperparameter tuning and echoes the broader narrative that increasing individual model complexity is more performance-effective than leveraging ensembles when parameter budgets are stationary.

Practically, the results offer guidance on model selection strategies, resource allocation, and architecture design in machine learning applications. The assertion that single large models should be preferred under a budget constraint promotes focusing computational resources and energies toward optimizing singular model configurations.

Future Directions

The prospects for extending these models to include feature learning are addressed, hinting at future work focused on feature-learning dynamics and its effects on ensemble performance. Additionally, there is fertile ground for exploring heterogeneous ensemble strategies where task specialization among sub-networks may break the constraints outlined in this work.

In conclusion, this paper presents a thorough analysis of feature ensembles, providing valuable insights into the dynamics of model variance and parameter efficiency. The "no free lunch" theorem posited serves as a clarion call to the machine learning community for reevaluating ensemble practices in light of optimal model utilization.