Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records (2110.09680v3)
Abstract: It has long been a recognized problem that many datasets contain significant levels of missing numerical data. A potentially critical predicate for application of machine learning methods to datasets involves addressing this problem. However, this is a challenging task. In this paper, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show that the multi-level method significantly outperforms current approaches and is numerically robust. It has superior accuracy as compared with methods recommended in the recent report from HCUP. Benchmark tests show up to 75% reductions in error. Furthermore, the results are also superior to recent state of the art methods such as discriminative deep learning.
- Y. Xu, Z. Zhang, L. You, J. Liu, Z. Fan, and X. Zhou, “scIGANs: single-cell rna-seq imputation using generative adversarial networks,” Nucleic acids research, vol. 48, no. 15, pp. e85–e85, 2020.
- W. V. Li and J. J. Li, “An accurate and robust imputation method scimpute for single-cell rna-seq data,” Nature communications, vol. 9, no. 1, p. 997, 2018.
- C. Arisdakessian, O. Poirion, B. Yunits, X. Zhu, and L. X. Garmire, “DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell rna-seq data,” Genome Biology, vol. 20, no. 1, p. 211, Oct 2019.
- D. Lee, J. Kim, W.-J. Moon, and J. C. Ye, “Collagan: Collaborative gan for missing image data imputation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2487–2496.
- Y. Luo, X. Cai, Y. Zhang, J. Xu et al., “Multivariate time series imputation with generative adversarial networks,” Advances in neural information processing systems, vol. 31, 2018.
- W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “Brits: Bidirectional recurrent imputation for time series,” Advances in neural information processing systems, vol. 31, 2018.
- Y. Duan, Y. Lv, Y.-L. Liu, and F.-Y. Wang, “An efficient realization of deep learning for traffic data imputation,” Transportation research part C: emerging technologies, vol. 72, pp. 168–181, 2016.
- X. Chen, Z. He, and L. Sun, “A bayesian tensor decomposition approach for spatiotemporal traffic data imputation,” Transportation research part C: emerging technologies, vol. 98, pp. 73–84, 2019.
- G. E. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Applied artificial intelligence, vol. 17, no. 5-6, pp. 519–533, 2003.
- F. Biessmann, D. Salinas, S. Schelter, P. Schmidt, and D. Lange, “” deep” learning for missing value imputationin tables with non-numerical data,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 2017–2025.
- C. Shang, A. Palmer, J. Sun, K.-S. Chen, J. Lu, and J. Bi, “Vigan: Missing view imputation with generative adversarial networks,” in 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017, pp. 766–775.
- J. Yoon, J. Jordon, and M. Schaar, “Gain: Missing data imputation using generative adversarial nets,” in International conference on machine learning. PMLR, 2018, pp. 5689–5698.
- S. C.-X. Li, B. Jiang, and B. Marlin, “Misgan: Learning from incomplete data with generative adversarial networks,” in International Conference on Learning Representations.
- A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera, “Handling incomplete heterogeneous data using vaes,” Pattern Recognition, vol. 107, p. 107501, 2020.
- Y. L. Qiu, H. Zheng, and O. Gevaert, “Genomic data imputation with variational auto-encoders,” GigaScience, vol. 9, no. 8, p. giaa082, 2020.
- S. Jäger, A. Allhorn, and F. Bießmann, “A benchmark for data imputation methods,” Frontiers in big Data, vol. 4, p. 693674, 2021.
- T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A survey on missing data in machine learning,” Journal of Big Data, vol. 8, no. 1, pp. 1–37, 2021.
- T. Thomas and E. Rajabi, “A systematic review of machine learning-based missing value imputation techniques,” Data Technologies and Applications, vol. 55, no. 4, pp. 558–585, 2021.
- I. R. White, P. Royston, and A. M. Wood, “Multiple imputation using chained equations: Issues and guidance for practice,” Statistics in Medicine, vol. 30, no. 4, pp. 377–399, 2011.
- J. L. Schafer, “Multiple imputation: a primer,” Statistical Methods in Medical Research, vol. 8, no. 1, pp. 3–15, 1999.
- J. Honaker, G. King, and M. Blackwell, “Amelia ii: A program for missing data,” Journal of Statistical Software, vol. 45, no. 7, pp. 1–47, 2011.
- D. Bertsimas, C. Pawlowski, and Y. D. Zhuo, “From predictive methods to missing data imputation: An optimization approach,” Journal of Machine Learning Research, vol. 18, no. 196, pp. 1–39, 2018. [Online]. Available: http://jmlr.org/papers/v18/17-073.html
- A. Shtiliyanova, G. Bellocchi, D. Borras, U. Eza, R. Martin, and P. Carrère, “Kriging-based approach to predict missing air temperature data,” Computers and Electronics in Agriculture, vol. 142, pp. 440–449, 2017.
- H. Yang, J. Yang, L. D. Han, X. Liu, L. Pu, S.-m. Chin, and H.-l. Hwang, “A kriging based spatiotemporal approach for traffic volume data imputation,” PLOS ONE, vol. 13, no. 4, pp. 1–11, 04 2018.
- J. E. Castrillón-Candás, J. Li, and V. Eijkhout, “A discrete adapted hierarchical basis solver for radial basis function interpolation,” BIT Numerical Mathematics, vol. 53, no. 1, pp. 57–86, 2013. [Online]. Available: http://dx.doi.org/10.1007/s10543-012-0397-x
- J. E. Castrillón-Candás, M. G. Genton, and R. Yokota, “Multi-level restricted maximum likelihood covariance estimation and Kriging for large non-gridded spatial datasets,” Spatial Statistics, vol. 18, Part A, pp. 105 – 124, 2016, spatial Statistics Avignon: Emerging Patterns. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2211675315000834
- J. E. Castrillón-Candás, “High dimensional multilevel Kriging: A computational mathematics approach,” Arxiv, 2021, https://arxiv.org/abs/1701.00285.
- J. E. Castrillón-Candás, F. Nobile, and R. Tempone, “Analytic regularity and collocation approximation for PDEs with random domain deformations,” Computers and Mathematics with applications, vol. 71, no. 6, pp. 1173–1197, 2016.
- J. E. Castrillón-Candás and J. Xu, “A stochastic collocation approach for parabolic PDEs with random domain deformations,” Computers & Mathematics with Applications, vol. 93, pp. 32–49, 2021.
- J. E. Castrillón-Candás, F. Nobile, and R. F. Tempone, “A hybrid collocation-perturbation approach for PDEs with random domains,” Advances in Computational Mathematics, vol. 47, no. 3, p. 40, May 2021.
- F. Nobile and R. Tempone, “Analysis and implementation issues for the numerical approximation of parabolic equations with random coefficients,” International Journal for Numerical Methods in Engineering, vol. 80, no. 6-7, pp. 979–1006, 2009.
- I. Babuska, F. Nobile, and R. Tempone, “A stochastic collocation method for elliptic partial differential equations with random input data,” SIAM Review, vol. 52, no. 2, pp. 317–355, 2010.
- HCUP, “HCUP National Inpatient Sample (NIS). Healthcare Cost and Utilization Project (HCUP),” www.hcup-us.ahrq.gov/nisoverview.jsp, 2012, agency for Healthcare Research and Quality, Rockville, MD.
- Y. Sun, B. Li, and M. G. Genton, “Geostatistics for large datasets,” in Space-Time Processes and Challenges Related to Environmental Problems, M. Porcu, J. M. Montero, and M. Schlather, Eds. Springer, 2012, pp. 55–77.
- Y. Sun and M. L. Stein, “Statistically and computationally efficient estimating equations for large spatial datasets,” Journal of Computational and Graphical Statistics, p. in press, 2015.
- M. L. Stein, J. Chen, and M. Anitescu, “Stochastic approximation of score functions for Gaussian processes,” Annals of Applied Statistics, vol. 7, pp. 1162–1191, 2013.
- ——, “Difference filter preconditioning for large covariance matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 33, no. 1, pp. 52–72, 2012.
- M. L. Stein, Z. Chi, and L. J. Welty, “Approximating likelihoods for large spatial data sets,” Journal of the Royal Statistical Society, Series B, vol. 66, pp. 275–296, 2004.
- R. Furrer and M. G. Genton, “Aggregation-cokriging for highly-multivariate spatial data,” Biometrika, vol. 98, no. 3, pp. 615–631, 2011.
- R. Furrer, M. G. Genton, and D. Nychka, “Covariance tapering for interpolation of large spatial datasets,” Journal of Computational and Graphical Statistics, vol. 15, no. 3, pp. 502–523, 2006.
- M. Anitescu, J. Chen, and L. Wang, “A matrix-free approach for solving the parametric Gaussian process maximum likelihood problem,” SIAM Journal on Scientific Computing, vol. 34, no. 1, pp. 240–262, feb 2012. [Online]. Available: http://dx.doi.org/10.1137/110831143
- A. Litvinenko, Y. Sun, M. G. Genton, and D. E. Keyes, “Likelihood approximation with hierarchical matrices for large spatial datasets,” Computational Statistics & Data Analysis, vol. 137, pp. 115–132, September 2019.
- D. Liu and H. G. Matthies, “Pivoted Cholesky decomposition by cross approximation for efficient solution of kernel systems,” Arxiv, 2019.
- L. Ying, G. Biros, and D. Zorin, “A kernel-independent adaptive fast multipole method in two and three dimensions,” Journal of Computational Physics, vol. 196, no. 2, pp. 591–626, 2004.
- H. Jin, Q. Song, and X. Hu, “Auto-keras: An efficient neural architecture search system,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1946–1956.
- J. E. Castrillón-Candás and M. Kon, “Anomaly detection: A functional analysis perspective,” Journal of Multivariate Analysis, vol. 189, p. 104885, 2022.