Optimal Ratio for Data Splitting (2202.03326v1)

Published 7 Feb 2022 in stat.ML and cs.LG

Abstract: It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well.

Citations (364)

View on Semantic Scholar

Summary

The paper introduces a theoretically derived data splitting ratio (√p:1) that challenges conventional heuristics like the 80:20 rule.
It develops a mathematical framework using linear regression and squared error loss to minimize estimation error, validated through simulations and real-world datasets.
This approach equips researchers with practical guidelines for model validation by endorsing a two-step parameter estimation strategy, including AIC for effective model selection.

An Analysis of "SPlit: An Optimal Method for Data Splitting"

In the paper titled "SPlit: An Optimal Method for Data Splitting," authors V. Roshan Joseph and Akhil Vakayil address the significant issue in statistical and machine learning processes of determining the optimal training and testing split ratio for datasets. Typically, practitioners resort to heuristics such as the 80:20 split, but Joseph and Vakayil propose a theoretically derived optimal ratio expressed as $\sqrt{p}:1$ , where $p$ represents the number of parameters in a linear regression model that adequately explains the data.

Core Findings

To establish this optimal ratio, the authors propose a mathematical framework that minimizes the estimation error's expected squared value across all possible splits. Unlike previous efforts, which often result in disparate optimal ratios, the authors' derivation suggests a straightforward $\sqrt{p}:1$ ratio. For instance, this translates to a 50:50 split for a model with a single parameter and progressively moves towards more training data as the number of parameters increases, e.g., a 90:10 split for 81 parameters.

Methodology

The research employs a linear regression context with the squared error loss function to distill this optimal ratio. Proposition 1 lays out essential equations to calculate the variance and expectation of the generalization error when a linear model with $p$ parameters is used. The authors effectively utilize a series of assumptions such as the normality of errors, independence of rows, and matched split criteria for these derivations.

The optimal split ratio was also examined using asymptotic analysis, reinforced by variance considerations reduced to $\mathcal{O}(1/m^2)$ . Using both theoretical examination and simulation studies, the authors confirm that their proposed ratio holds under various dataset sizes and feature complexities.

Implications and Practical Strategy

From a practical standpoint, the proposed ratio provides empirical researchers with a proof-based guideline over arbitrary or heuristic splits. Having a tailored splitting ratio based on the number of model parameters ensures efficient and unbiased model validation.

Moreover, the authors suggest a two-step strategy for determining $p$ : first, expand the data set into a comprehensive feature set, then use model selection criteria like AIC to estimate the effective number of parameters.

Experimental Validation

The authors validate their results using both simulations and real-world datasets like the concrete compressive strength dataset. Through these, they illustrate how varying splits impact model performance across various algorithms, confirming the robustness of the suggested method.

Future Directions

While the paper focuses on linear regression models, there is potential to extend the methodology to nonparametric methods or models with regularization, which effectively reduces their number of parameters. Additionally, the authors hint at the potential incorporation of effective parameter estimation techniques and the treatment of physics-based models, opening avenues for broader application of this method.

Conclusion

This paper defines a clear, mathematically supported approach for determining the dataset split in model training and validation phases, challenging the inertia of conventionally used ratios. The approach is particularly relevant in an era characterized by colossal datasets and models with numerous parameters, offering a balanced path to minimized prediction error and statistical efficiency.

PDF Markdown