Cross-validation: what does it estimate and how well does it do it? (2104.00673v4)

Published 1 Apr 2021 in stat.ME, math.ST, stat.CO, stat.ML, and stat.TH

Abstract: Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.

Citations (195)

View on Semantic Scholar

Summary

The paper reveals that cross-validation estimates average prediction error across models fitted on different datasets, rather than the error of the specific model on the current data.
It demonstrates that conventional confidence intervals derived from cross-validation are often miscalibrated due to dependencies between training and testing samples.
The authors propose a nested cross-validation scheme to correct variance estimates, offering a more robust framework for predictive model evaluation.

An Analysis of Cross-Validation Estimation and Performance

In their paper entitled "Cross-validation: What Does It Estimate and How Well Does It Do It?", Bates, Hastie, and Tibshirani provide a rigorous exploration of the properties and limitations of cross-validation (CV) as a method for estimating prediction error in statistical modeling. The paper addresses both theoretical and empirical perspectives, offering significant insights into the behavior of CV, particularly its estimand and the accuracy of its confidence intervals under certain conditions.

Cross-validation is a fundamental technique in statistics and machine learning for estimating model prediction accuracy. Traditionally, it is valued for its simplicity and utility in improving estimates over simple holdout methods. The authors, however, focus on a nuanced understanding of what CV actually estimates. Specifically, they delineate the distinction between the expected error of the model fit to the given data and the average error over multiple potential datasets drawn from the same distribution.

Key Findings

Estimands of Cross-Validation:
- The authors clarify that, contrary to popular belief, CV does not estimate the prediction error of the specific model fit on the available data. Instead, it estimates the average prediction error across models fit on various hypothetical sets from the same population. This result, derived for the linear model case, suggests that while CV provides reliable average error estimates, it may not accurately reflect the error associated with the model on the specific dataset in hand.
Confidence Intervals:
- The paper examines the coverage probability of standard confidence intervals derived from CV. It is revealed empirically and theoretically that such intervals often fall short of the desired level due to inherent correlations not accounted for in traditional formulations. This miscoverage arises because points used for both training and testing create dependencies among accuracy measurements.
- To address these deficiencies, the authors introduce a nested cross-validation approach that provides a more accurate estimate of the variance associated with CV-based error estimates, leading to more reliable confidence intervals.
Implications for Other Methods:
- The paper extends its analysis to other error estimation techniques, such as bootstrapping and analytic methods like Mallow's $C_p$ , finding similar estimand misalignments. This broader critique suggests the necessity of reevaluating common practices across different methodological frameworks.

Theoretical Implications

The profound implication of this work is the potential reevaluation of how model validation metrics are employed across statistical and machine learning practices. The notion that CV does not directly provide information about the model's true prediction error on the current data challenges long-held assumptions in model selection and performance evaluation.

Furthermore, the introduction of their nested cross-validation scheme offers a promising direction for future methodologies, particularly in providing robust error estimates that acknowledge and correct for the bias and variability overlooked by traditional methods.

Practical Implications

Practitioners are prompted to reconsider their use of cross-validation and similar techniques, especially when deploying models where accurate prediction error quantification is critical. The proposed solutions may be computationally more intensive but provide necessary correction for miscoverage in traditional confidence intervals, thus ensuring more valid inferential conclusions.

Future Directions

The research lays a foundation for further investigations into the stability and reliability of model evaluation techniques. Future work could expand on:

Quantifying the impact of these findings on real-world datasets especially with complex models that are ubiquitous in today’s machine learning paradigms.
Extending similar studies to other forms of cross-validation or resampling schemes in different statistical frameworks.
Developing more nuanced statistical models which incorporate the findings of this work into scalable machine learning systems.

In summary, the paper by Bates, Hastie, and Tibshirani not only challenges the status quo surrounding the use of cross-validation but also provides a pathway toward more accurate and reliable inference in predictive modeling. This refinement of understanding stands to meaningfully impact both methodological research and practical applications in the field.

PDF Markdown