- The paper presents closed-form sample size formulas to target threshold-based performance measures such as sensitivity, specificity, PPV, and NPV.
- It extends traditional criteria to include binary and time-to-event outcomes, often reducing required sample sizes compared to calibration methods.
- Application of the methodology to the ISARIC 4C Deterioration model demonstrates its clinical relevance and improved model validation.
Extended Sample Size Calculations for Evaluation of Prediction Models Using a Threshold for Classification
The paper "Extended Sample Size Calculations for Evaluation of Prediction Models Using a Threshold for Classification" presents an important advancement in the methodology for evaluating the performance of prediction models, particularly when assessing classifications based upon clinically relevant probability thresholds. This work builds upon previously published guidelines for determining sample sizes in prediction model evaluation, introducing new criteria and methodologies that specifically address threshold-based performance measures.
Overview of Contributions
The authors provide closed-form solutions for calculating the required sample size to target precise estimates of several performance measures: accuracy, specificity, sensitivity (recall), Positive Predictive Value (PPV), Negative Predictive Value (NPV), and the F1-score. These solutions are applicable in external validation studies of prediction models with binary outcomes and require pre-specification of target standard errors and expected values for each performance measure.
Key contributions include:
- Threshold-Based Performance Calculation: Extending existing sample size calculation guidance to include performance measures that are evaluated at particular probability thresholds. This is particularly relevant in the context of increasing use of machine learning techniques, which often incorporate such thresholds.
- Formulae Derivation and Example Implementation: Derivation of the necessary formulae for these threshold-based performance measures, along with illustrative examples that guide researchers in applying these methods to real-world prediction models.
- Time-to-Event Extensions: While the primary focus is on binary outcomes, the paper also considers extensions to time-to-event data, accommodating scenarios with and without censoring before the time horizon for prediction.
Application and Findings
The authors exemplify the application of their methodology with the ISARIC 4C Deterioration model for predicting COVID-19-related hospital deterioration, comparing the newly proposed sample size requirements with previous criteria focused on calibration, discrimination, and net benefit. For targeting a confidence interval width of 0.1, a minimum required sample size of 933 was determined, driven by the NPV measure.
The authors highlight that the sample sizes required based on the new criteria are often smaller than those driven by calibration slope precision requirements. For instance, while the NPV required a sample size of 933, a larger size of 949 was necessary when targeting a precise calibration slope.
Implications and Future Directions
The practical implications of this research are substantial, providing prediction model evaluators with refined tools to determine sample sizes that are sufficiently powered for threshold-based performance measures. This adds robustness to model validation efforts, particularly for models incorporating machine learning approaches.
Theoretically, this work promotes a nuanced understanding of classification accuracy and its various components in predictive settings, encouraging meaningful estimation of threshold-based measures along with traditional metrics. It pushes forward the rigor in predictive modeling, helping to bridge the gap between statistical innovations and clinical applicability.
For future developments, further exploration might focus on refining these techniques to handle models with more complex structures, such as those leveraging deep learning methods. Additionally, expanding these calculations to encompass probabilistic models with varying levels of uncertainty, and more complex outcome distributions, could enhance the generalizability and applicability of these results across diverse predictive modeling domains.
Through its methodological advancements, this paper paves a pathway for enhanced precision and relevance in the evaluation of prediction models, ensuring that they meet the demands of clinical and operational settings without compromising on scientific rigor. Researchers are encouraged to adopt these extended sample size criteria particularly in settings where clinical thresholds are a pivotal aspect of model utility.