The harms of class imbalance corrections for machine learning based prediction models: a simulation study (2404.19494v1)
Abstract: Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data-generating scenarios (varying sample size, the number of predictors and event fraction). Our findings were illustrated in a case study using MIMIC-III data. In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over-estimation of risk and was not always able to be corrected with re-calibration. Correcting for class imbalance is not always necessary and may even be harmful for clinical prediction models which aim to produce reliable risk estimates on an individual basis.
- E.W. Steyerberg. Applications of prediction models, pages 11–31. Springer New York, New York, NY, 2009.
- Lingxiao Chen. Overview of clinical prediction models. Annals of Translational Medicine, 8(4), 2019.
- Calibration: the achilles heel of predictive analytics. BMC Medicine, 17(1):230, 2019.
- Clinical prediction models: diagnosis versus prognosis. Journal of Clinical Epidemiology, 132:142–145, April 2021.
- Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ, 338:b606, June 2009. Publisher: British Medical Journal Publishing Group Section: Research Methods & Reporting.
- The class imbalance problem. Nature Methods, 18(11):1270–1272, 2021.
- An insight into rare class problem: Analysis and potential solutions. Journal of Computer Science, 14(6):777–792, May 2018.
- Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7):6585–6608, 2012.
- Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
- On the 12th day of christmas, a statistician sent to me . . . BMJ, 379, 2022.
- Calibration of risk prediction models: Impact on decision-analytic performance. Medical Decision Making, 35(2):162–169, 2015.
- Advantages of the nested case-control design in diagnostic research. BMC medical research methodology, 8:48, July 2008.
- The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. Journal of the American Medical Informatics Association, 29(9):1525–1534, 06 2022.
- Systematic review identifies the design and methodological conduct of studies on machine learning-based prediction models. Journal of Clinical Epidemiology, 11 2022.
- Joie Ensor and Emma C. Martin and Richard D. Riley. pmsampsize: Calculates the Minimum Sample Size Required for Developing a Multivariable Prediction Model, 2022. R package version 1.1.2.
- Equivalence of improvement in area under roc curve and linear discriminant analysis coefficient under assumption of normality. Statistics in Medicine, 30(12):1410–1418, 2011.
- A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20–29, jun 2004.
- A comprehensive investigation of the performances of different machine learning classifiers with smote-enn oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Scientific Programming, 2022:3649406, 2022.
- Case–Control and Two-Gate Designs in Diagnostic Accuracy Studies. Clinical Chemistry, 51:1335–1341, August 2005. _eprint: https://academic.oup.com/clinchem/article-pdf/51/8/1335/32682656/clinchem1335.pdf.
- SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, jun 2002.
- Iric: An r library for binary imbalanced classification. SoftwareX, 10:100341, 2019.
- Dennis L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):408–421, 1972.
- Empirical assessment of ensemble based approaches to classify imbalanced data in binary classification. International Journal of Advanced Computer Science and Applications, 2019.
- Rusboost: Improving classification performance when training data is skewed. pages 1–4, 2008.
- Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
- Max Kuhn. Building predictive models in r using the caret package. Journal of Statistical Software, 28(5):1–26, 2008.
- Hsiang Hao and Chen. ebmc: Ensemble-based methods for class imbalance problem. 2022. R package version 1.0.1.
- Björn Böken. On the appropriateness of platt scaling in classifier calibration. Information Systems, 95:101641, 2021.
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021.
- Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
- Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology (Cambridge, Mass.), 21(1):128–138, 01 2010.
- proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12:77, 2011.
- MIMIC-III, a freely accessible critical care database. Scientific Data, 3:160035, May 2016.
- MIMIC-III clinical database (version 1.4). PhysioNet, 2016.
- Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery, 9(3), jan 2019.
- ROSE: a Package for Binary Imbalanced Learning. R Journal, 6(1):82–92, 2014.