Extended sample size calculations for evaluation of prediction models using a threshold for classification (2406.19673v1)

Published 28 Jun 2024 in stat.ME

Abstract: When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures can also be used. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have developed closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, PPV, NPV, and F1-score in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R and Stata commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit.

Summary

The paper presents closed-form sample size formulas to target threshold-based performance measures such as sensitivity, specificity, PPV, and NPV.
It extends traditional criteria to include binary and time-to-event outcomes, often reducing required sample sizes compared to calibration methods.
Application of the methodology to the ISARIC 4C Deterioration model demonstrates its clinical relevance and improved model validation.

Extended Sample Size Calculations for Evaluation of Prediction Models Using a Threshold for Classification

The paper "Extended Sample Size Calculations for Evaluation of Prediction Models Using a Threshold for Classification" presents an important advancement in the methodology for evaluating the performance of prediction models, particularly when assessing classifications based upon clinically relevant probability thresholds. This work builds upon previously published guidelines for determining sample sizes in prediction model evaluation, introducing new criteria and methodologies that specifically address threshold-based performance measures.

Overview of Contributions

The authors provide closed-form solutions for calculating the required sample size to target precise estimates of several performance measures: accuracy, specificity, sensitivity (recall), Positive Predictive Value (PPV), Negative Predictive Value (NPV), and the F1-score. These solutions are applicable in external validation studies of prediction models with binary outcomes and require pre-specification of target standard errors and expected values for each performance measure.

Key contributions include:

Threshold-Based Performance Calculation: Extending existing sample size calculation guidance to include performance measures that are evaluated at particular probability thresholds. This is particularly relevant in the context of increasing use of machine learning techniques, which often incorporate such thresholds.
Formulae Derivation and Example Implementation: Derivation of the necessary formulae for these threshold-based performance measures, along with illustrative examples that guide researchers in applying these methods to real-world prediction models.
Time-to-Event Extensions: While the primary focus is on binary outcomes, the paper also considers extensions to time-to-event data, accommodating scenarios with and without censoring before the time horizon for prediction.

Application and Findings

The authors exemplify the application of their methodology with the ISARIC 4C Deterioration model for predicting COVID-19-related hospital deterioration, comparing the newly proposed sample size requirements with previous criteria focused on calibration, discrimination, and net benefit. For targeting a confidence interval width of 0.1, a minimum required sample size of 933 was determined, driven by the NPV measure.

The authors highlight that the sample sizes required based on the new criteria are often smaller than those driven by calibration slope precision requirements. For instance, while the NPV required a sample size of 933, a larger size of 949 was necessary when targeting a precise calibration slope.

Implications and Future Directions

The practical implications of this research are substantial, providing prediction model evaluators with refined tools to determine sample sizes that are sufficiently powered for threshold-based performance measures. This adds robustness to model validation efforts, particularly for models incorporating machine learning approaches.

Theoretically, this work promotes a nuanced understanding of classification accuracy and its various components in predictive settings, encouraging meaningful estimation of threshold-based measures along with traditional metrics. It pushes forward the rigor in predictive modeling, helping to bridge the gap between statistical innovations and clinical applicability.

For future developments, further exploration might focus on refining these techniques to handle models with more complex structures, such as those leveraging deep learning methods. Additionally, expanding these calculations to encompass probabilistic models with varying levels of uncertainty, and more complex outcome distributions, could enhance the generalizability and applicability of these results across diverse predictive modeling domains.

Through its methodological advancements, this paper paves a pathway for enhanced precision and relevance in the evaluation of prediction models, ensuring that they meet the demands of clinical and operational settings without compromising on scientific rigor. Researchers are encouraged to adopt these extended sample size criteria particularly in settings where clinical thresholds are a pivotal aspect of model utility.

PDF Markdown

Tweets

https://twitter.com/MaartenvSmeden/status/1808068711019901080

https://twitter.com/Richard_D_Riley/status/1807799966175240384