Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ITI-IQA: a Toolbox for Heterogeneous Univariate and Multivariate Missing Data Imputation Quality Assessment (2407.11767v1)

Published 16 Jul 2024 in cs.LG

Abstract: Missing values are a major challenge in most data science projects working on real data. To avoid losing valuable information, imputation methods are used to fill in missing values with estimates, allowing the preservation of samples or variables that would otherwise be discarded. However, if the process is not well controlled, imputation can generate spurious values that introduce uncertainty and bias into the learning process. The abundance of univariate and multivariate imputation techniques, along with the complex trade-off between data reliability and preservation, makes it difficult to determine the best course of action to tackle missing values. In this work, we present ITI-IQA (Imputation Quality Assessment), a set of utilities designed to assess the reliability of various imputation methods, select the best imputer for any feature or group of features, and filter out features that do not meet quality criteria. Statistical tests are conducted to evaluate the suitability of every tested imputer, ensuring that no new biases are introduced during the imputation phase. The result is a trainable pipeline of filters and imputation methods that streamlines the process of dealing with missing data, supporting different data types: continuous, discrete, binary, and categorical. The toolbox also includes a suite of diagnosing methods and graphical tools to check measurements and results during and after handling missing data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Xgboost: extreme gradient boosting, R package version 0.4-2 1 (2015) 1–4. doi:http://dx.doi.org/10.32614/cran.package.xgboost.
  2. J. L. Schafer, J. W. Graham, Missing data: our view of the state of the art., Psychological methods 7 (2002) 147. doi:http://dx.doi.org/10.1037//1082-989x.7.2.147.
  3. Missing covariate data in medical research: to impute is better than to ignore, Journal of clinical epidemiology 63 (2010) 721–727. doi:http://dx.doi.org/10.1016/j.jclinepi.2009.12.008.
  4. J. W. Graham, Missing data analysis: Making it work in the real world, Annual review of psychology 60 (2009) 549–576. doi:http://dx.doi.org/10.1146/annurev.psych.58.110405.085530.
  5. A potential for bias when rounding in multiple imputation, The American Statistician 57 (2003) 229–232. doi:http://dx.doi.org/10.1198/0003130032314.
  6. Robustness of a multivariate normal approximation for imputation of incomplete binary data, Statistics in medicine 26 (2007) 1368–1382. doi:http://dx.doi.org/10.1002/sim.2619.
  7. D. B. Rubin, Multiple imputation, in: Flexible Imputation of Missing Data, Second Edition, Chapman and Hall/CRC, 2018, pp. 29–62.
  8. R. Little, D. Rubin, Multiple imputation for nonresponse in surveys, John Wiley & Sons, Inc.. doi 10 (1987) 9780470316696. doi:http://dx.doi.org/10.1002/9780470316696.
  9. Missing value estimation methods for dna microarrays, Bioinformatics 17 (2001) 520–525. doi:http://dx.doi.org/10.1093/bioinformatics/17.6.520.
  10. S. van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate imputation by chained equations in r, Journal of Statistical Software 45 (2011) 1–67. URL: https://www.jstatsoft.org/index.php/jss/article/view/v045i03. doi:10.18637/jss.v045.i03.
  11. S. F. Buck, A method of estimation of missing values in multivariate data suitable for use with an electronic computer, Journal of the Royal Statistical Society. Series B (Methodological) 22 (1960) 302–306. URL: http://www.jstor.org/stable/2984099. doi:http://dx.doi.org/10.1111/j.2517-6161.1960.tb00375.x.
  12. Multiple imputation by chained equations: what is it and how does it work?, International journal of methods in psychiatric research 20 (2011) 40–49. doi:http://dx.doi.org/10.1002/mpr.329.
  13. M. E. Tipping, Sparse bayesian learning and the relevance vector machine, Journal of machine learning research 1 (2001) 211–244. doi:http://dx.doi.org/10.7551/mitpress/1120.003.0054.
  14. Classification and regression by randomforest, R news 2 (2002) 18–22. doi:http://dx.doi.org/10.32614/cran.package.randomforest.
  15. How many imputations are really needed? some practical clarifications of multiple imputation theory, Prevention science 8 (2007) 206–213. doi:http://dx.doi.org/10.1007/s11121-007-0070-9.
  16. The uci machine learning repository, URL https://archive. ics. uci. edu (2023). doi:http://dx.doi.org/10.1609/aaai.v37i7.25991.
  17. A benchmark for data imputation methods, Frontiers in big Data 4 (2021) 693674. doi:http://dx.doi.org/10.3389/fdata.2021.693674.
  18. A method for comparing multiple imputation techniques: A case study on the us national covid cohort collaborative, Journal of biomedical informatics 139 (2023) 104295. doi:http://dx.doi.org/10.1016/j.jbi.2023.104295.
  19. S. Van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate imputation by chained equations in r, Journal of statistical software 45 (2011) 1–67. doi:http://dx.doi.org/10.32614/cran.package.mice.
  20. Multiple imputation: a flexible tool for handling missing data, Jama 314 (2015) 1966–1967. doi:http://dx.doi.org/10.1001/jama.2015.15281.
  21. Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. doi:http://dx.doi.org/10.3389/fninf.2014.00014.
  22. Missing value estimation methods for DNA microarrays , Bioinformatics 17 (2001) 520–525. URL: https://doi.org/10.1093/bioinformatics/17.6.520. doi:10.1093/bioinformatics/17.6.520.
  23. L. J. Beesley, J. M. Taylor, Accounting for not-at-random missingness through imputation stacking, Statistics in medicine 40 (2021) 6118–6132. doi:http://dx.doi.org/10.1002/sim.9174.
  24. No imputation without representation, arXiv preprint arXiv:2206.14254 (2022). doi:http://dx.doi.org/10.1016/j.ins.2024.120385.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com