On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets (2403.14687v1)
Abstract: Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values.
- Roderick J A Little and Donald B Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, Inc., USA, 2002.
- Kaggle. Breast cancer wisconsin (diagnostic) data set. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data, 2016.
- Kaggle. Heart disease uci. https://www.kaggle.com/ronitf/heart-disease-uci, 2018.
- Kaggle. Pima indians diabetes database. https://www.kaggle.com/uciml/pima-indians-diabetes-database, 2016.
- Adam Felman. What to know about breast cancer. https://www.medicalnewstoday.com/articles/37136#symptoms, 2021.
- Adam Felman. Everything you need to know about heart disease. https://www.medicalnewstoday.com/articles/237191, 2021.
- Mortality in the united states, 2017. 2018.
- Prevention and treatment of item nonresponse. Journal of Official Statistics, 19:153–176, 2003.
- A comparison of imputation methods for handling missing scores in biometric fusion. Pattern Recognition, 45(3):919–933, 2012.
- Imputation methods for addressing missing data in short-term monitoring of air pollutants. Science of The Total Environment, page 139140, 2020.
- Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10):913–933, 2019.
- Hyun Kang. The prevention and handling of the missing data. Korean journal of anesthesiology, 64(5):402, 2013.
- Fredrick Ochieng’Odhiambo. Comparative study of various methods of handling missing data. Mathematical Modelling and Applications, 5(2):87, 2020.
- Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Mathematical problems in engineering, 2015, 2015.
- John M Lachin. Fallacies of last observation carried forward analyses. Clinical trials, 13(2):161–168, 2016.
- Shichao Zhang. Nearest neighbor selection for iteratively knn imputation. Journal of Systems and Software, 85(11):2541–2552, 2012.
- Filling missing data using interpolation methods: Study on the effect of fitting distribution, volume 594. Trans Tech Publ, 2014.
- Comparison of linear interpolation method and mean method to replace the missing values in environmental data set. In Materials Science Forum, volume 803, pages 278–281. Trans Tech Publ, 2015.
- Comparison of interpolation, statistical, and data-driven methods for imputation of missing values in a distributed soil moisture dataset. Journal of Hydrologic Engineering, 19(1):26–43, 2014.
- Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC medical research methodology, 20(1):1–12, 2020.
- Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012.
- Multiple imputation by chained equations in praxis: Guidelines and review. Electronic Journal of Business Research Methods, 15(1), 2017.
- Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1):40–49, 2011.
- Handling missing data: analysis of a challenging data set using multiple imputation. International Journal of Research & Method in Education, 39(1):19–37, 2016.
- Miguel Macarro. imputena 1.0. https://pypi.org/project/imputena/, 2020.
- Ashim Bhattarai. missingpy 0.2.0. https://pypi.org/project/missingpy/, 2018.
- Saurav Kaushik. Introduction to feature selection methods with an example (or how to select the right variables? https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/, 2016.
- A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
- Opportunities and challenges of feature selection methods for high dimensional data: A review. Ingénierie des Systèmes d’Information, 26(1), 2021.
- Feature selection: a literature review. SmartCR, 4(3):211–229, 2014.
- A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.
- A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1):1, 2015.
- Missing traffic data: comparison of imputation methods. IET Intelligent Transport Systems, 8(1):51–57, 2014.
- Alexei Botchkarev. Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology. arXiv preprint arXiv:1809.03006, 2018.
- Christian Pascual. Tutorial: Understanding regression error metrics in python. https://www.dataquest.io/blog/understanding-regression-error-metrics/, 2018.
- Imputation techniques on missing values in breast cancer treatment and fertility data. Health Information Science and Systems, 7(1):1–8, 2019.