Performance evaluation of predictive AI models to support medical decisions: Overview and guidance (2412.10288v1)

Published 13 Dec 2024 in cs.LG, stat.ME, and stat.ML

Abstract: A myriad of measures to illustrate performance of predictive AI models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a "proper" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.

Summary

The paper evaluates 32 performance measures for binary predictive AI models in medicine, categorizing them into discrimination, calibration, overall, classification, and clinical utility.
It recommends a core set of measures including AUROC for discrimination, calibration plots for understanding probability correspondence, and net benefit (decision curve analysis) for clinical utility.
The guidance emphasizes using proper statistical and decision-analytic measures to ensure trust, improve clinical decision-making, and enhance reproducibility in medical AI evaluation.

Overview of Performance Evaluation Measures for Predictive AI Models in Medical Practice

The paper "Performance Evaluation of Predictive AI Models to Support Medical Decisions: Overview and Guidance," provides a comprehensive evaluation of performance measures for AI-based predictive models in medical practice. The authors focus on binary outcome models, which estimate the probability of disease diagnosis or prognosis in patients. The paper systematically assesses various statistical and decision-analytic performance measures and highlights essential considerations for selecting these measures in medical contexts, ultimately providing guidance on the most informative indicators of model performance.

Summary of Performance Measures

The authors categorize performance measures into five broad domains: discrimination, calibration, overall performance, classification, and clinical utility. They assess a total of 32 measures across these categories:

Discrimination: This domain evaluates the model's ability to distinguish between patients who will experience an event versus those who will not. AUROC, AUPRC, and partial AUROC measures are discussed. AUROC is preferred despite some criticisms regarding class imbalance because of its clear interpretation and semi-proper status.
Calibration: Calibration measures assess whether predicted probabilities reflect actual outcomes. Metrics such as the calibration plot, observed-to-expected (O:E) ratio, and calibration slope are crucial for understanding how well a model's predictions correspond to reality.
Overall Performance: Overall measures, such as loglikelihood, Brier score, and various R-squared statistics, are used to examine how closely estimated probabilities approach actual outcomes. These measures integrate elements of both discrimination and calibration.
Classification: Measures such as classification accuracy, Youden index, and F1 score evaluate the model's ability to correctly classify outcomes at a given threshold. However, these measures are identified as improper when they do not align with relevant clinical decision thresholds.
Clinical Utility: Clinical utility examines whether model use leads to beneficial clinical decision-making. The authors recommend using net benefit and decision curve analysis to quantify the improvement in decision-making conferred by the model.

Key Findings and Recommendations

The paper emphasizes the importance of using proper performance measures and maintaining a clear focus on either statistical or decision-analytic evaluation. Proper measures ensure that the expected value is maximized only when using the correct probabilities, thus enhancing trust in model conclusions. Among the 32 measures evaluated, 17 measures satisfied these criteria, while others were less appropriate due to improper characteristics or mixing statistical and decision-analytic evaluation.

Core Set of Recommended Measures:

AUROC: Essential for evaluating discrimination.
Calibration Plot (with smoothing): Offers visual insight into calibration characteristics.
Net Benefit (Decision Curve Analysis): Evaluates clinical utility across a range of decision thresholds.
Probability Distribution by Outcome Category: Crucial for understanding model behavior.

Implications and Future Directions

By identifying and recommending a subset of key measures, the authors aim to streamline the evaluation process for predictive AI models in medical practice. They highlight the distinction between statistical performance (discrimination and calibration) and decision-analytic performance (clinical utility), urging consideration of both aspects in model assessment.

The paper's insights have theoretical and practical implications. Theoretically, this contributes to the ongoing development of more sophisticated performance metrics. Practically, this guidance aids medical practitioners and researchers in selecting appropriate models for clinical decision support, potentially leading to improved patient outcomes and resource allocation.

As AI in medicine continues to evolve, future research will likely focus on refining these measures, incorporating patient and clinical variability, and assessing impacts across different medical contexts and populations. The authors also stress the importance of transparency, encouraging adherence to reporting standards to avoid performance hacking and ensure reproducibility and validity of results.

PDF Markdown

Related Papers

Tweets

https://twitter.com/eslmaher/status/1870086440865992736

https://twitter.com/pash22/status/1877082982168621261

https://twitter.com/doc_BLocke/status/1884064344075591761

https://twitter.com/RecPaperBot/status/1882865962359370211

https://twitter.com/arxivsanitybot/status/1868652518344900626

https://twitter.com/IstanbulEMDoc/status/1870567109753020788