In-Database Data Imputation (2401.03359v1)
Abstract: Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.
- Detecting Data Errors: Where Are We and What Needs to Be Done? Proc. VLDB Endow., 9(12):993–1004, 2016.
- T. E. P. Administration. Taiwan air quality data, 2016 - 2021.
- A Comparison of Imputation Techniques for Handling Missing Predictor Values in a Risk Model with a Binary Outcome. Statistical Methods in Medical Research, 16(3):277–298, 2007.
- A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78(1):40–64, 2010.
- LAPACK: A Portable Linear Algebra Library for High-Performance Computers. In Supercomputing, pages 2–11, 1990.
- Apache MADlib. https://madlib.apache.org, 2023.
- Apache SystemDS. https://systemds.apache.org, 2023.
- Missing Value Imputation on Multidimensional Time Series. Proc. VLDB Endow., 14(11):2533–2545, 2021.
- Influence Functions in Deep Learning Are Fragile. In ICLR, 2021.
- Discovery of Genuine Functional Dependencies from Relational Data with Missing Values. Proc. VLDB Endow., 11(8):880–892, 2018.
- ”Deep” Learning for Missing Value Imputationin Tables with Non-Numerical Data. In CIKM, pages 2017–2025, 2018.
- SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow., 9(13):1425–1436, 2016.
- S. v. Buuren. Flexible Imputation of Missing Data. CRC Press, 2nd edition, 2018.
- Query Optimization for Dynamic Imputation. Proc. VLDB Endow., 10(11):1310–1321, 2017.
- Towards Linear Algebra over Normalized Data. Proc. VLDB Endow., 10(11):1214–1225, 2017.
- Data Cleaning: Overview and Emerging Challenges. In SIGMOD, pages 2201–2206, 2016.
- Cleanits: A Data Cleaning System for Industrial Time Series. Proc. VLDB Endow., 12(12):1786–1789, 2019.
- A Survey on Missing Data in Machine Learning. Journal of Big Data, 8(1):1–37, 2021.
- Towards a Unified Architecture for In-RDBMS Analytics. In SIGMOD, pages 325–336, 2012.
- J. W. Graham. Missing Data Analysis: Making It Work in the Real World. Annual Review of Psychology, 60(1):549–576, 2009.
- The MADlib Analytics Library: or MAD Skills, the SQL. Proc. VLDB Endow., 5(12):1700–1711, 2012.
- B. Hilprecht and C. Binnig. ReStore – Neural Data Completion for Relational Databases. In SIGMOD, pages 710–722, 2021.
- JoinBoost: Grow Trees Over Normalized Data Using Only SQL, 2023.
- A Comparison of Multiple Imputation Methods for Missing Data in Longitudinal Studies. BMC Medical Research Methodology, 18(1):168, 2018.
- HyperImpute: Generalized Iterative Imputation with Automatic Model Selection. In ICML, pages 9916–9937, 2022.
- A Benchmark for Data Imputation Methods. Frontiers in Big Data, 4:1–16, 2021.
- Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. Proc. VLDB Endow., 14(3):255–267, 2020.
- AC/DC: In-Database Learning Thunderstruck. In DEEM, pages 1–10, 2018.
- ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams. Proc. VLDB Endow., 14(3):294–306, 2020.
- Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proc. VLDB Endow., 13(5):768–782, 2020.
- C. Koch. Incremental Query Evaluation in a Ring of Databases. In PODS, pages 87–98, 2010.
- Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38:59–75, 2015.
- Data Management in Machine Learning: Challenges, Techniques, and Systems. In SIGMOD, pages 1717–1722, 2017.
- Learning generalized linear models over normalized data, 2015.
- MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms. Advances in Neural Information Processing Systems, 34:23806–23817, 2021.
- J. M. Lachin. Fallacies of Last Observation Carried Forward Analyses. Clinical Trials, 13(2):161–168, 2016.
- R. Lall and T. Robinson. The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning. Political Analysis, 30(2):179–196, 2022.
- P.-A. Larson. Data Reduction by Partial Preaggregation. In ICDT, pages 706–715, 2002.
- Fast and reliable missing data contingency analysis with predicate-constraints. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 285–295. Association for Computing Machinery, 2020.
- Statistical Analysis with Missing Data. Wiley, 3rd edition, 2019.
- Adaptive Data Augmentation for Supervised Learning over Missing Data. Proc. VLDB Endow., 14(7):1202–1214, 2021.
- M. Mahdavi and Z. Abedjan. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow., 13(12):1948–1961, 2020.
- P.-A. Mattei and J. Frellsen. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In ICML, pages 4413–4423, 2019.
- Spectral Regularization Algorithms for Learning Large Incomplete Matrices. Journal of Machine Learning Research, 11(80):2287–2322, 2010.
- Evaluating the Impact of Multivariate Imputation by MICE in Feature Selection. PLOS ONE, 16(7):1–28, 2021.
- Efficient and Effective Data Imputation with Influence Functions. Proc. VLDB Endow., 15(3):624–632, 2021.
- MindsDB. https://mindsdb.com/, 2023.
- CoClean: Collaborative Data Cleaning. In SIGMOD, pages 2757–2760, 2020.
- M. Nikolic and D. Olteanu. Incremental View Maintenance with Triple Lock Factorization Benefits. In SIGMOD, pages 365–380, 2018.
- F-IVM: Learning over Fast-Evolving Relational Data. In SIGMOD, pages 2773–2776, 2020.
- U. D. of Transportation. Flight Delays and Cancellations, 2015.
- Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12(null):2825–2830, 2011.
- Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. Advances in Neural Information Processing Systems, 35:35839–35851, 2022.
- Self-Supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks. Proc. VLDB Endow., 16(3):433–446, 2022.
- Learning Over Dirty Data Without Cleaning. In SIGMOD, pages 1301–1316, 2020.
- FAHES: A Robust Disguised Missing Values Detector. In KDD, pages 2100–2109, 2018.
- HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow., 10(11):1190–1201, 2017.
- Online Topic-Aware Entity Resolution Over Incomplete Data Streams. In SIGMOD, pages 1478–1490, 2021.
- Horizon: Scalable Dependency-Driven Data Cleaning. Proc. VLDB Endow., 14(11):2546–2554, 2021.
- Missing Data: Our View of the State of the Art. Psychological Methods, 7(2):147–177, 2002.
- Automating Large-Scale Data Quality Verification. Proc. VLDB Endow., 11(12):1781–1794, 2018.
- M. Schleich and D. Olteanu. Lmfao: An engine for batches of group-by aggregates: Layered multiple functional aggregate optimization. Proc. VLDB Endow., 13(12):2945–2948, 2020.
- A Layered Aggregate Engine for Analytics Workloads. In SIGMOD, pages 1642–1659, 2019.
- Learning Linear Regression Models over Factorized Joins. In SIGMOD, pages 3–18, 2016.
- D. J. Stekhoven and P. Bühlmann. MissForest – Non-Parametric Missing Value Imputation for Mixed-Type Data. Bioinformatics, 28(1):112–118, 2012.
- Responsible Data Management. Proc. VLDB Endow., 13(12):3474–3488, 2020.
- Troubles with Nulls, Views from the Users. Proc. VLDB Endow., 15(11):2613–2625, 2022.
- S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3):1–67, 2011.
- GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML, volume 80, pages 5689–5698, 2018.
- S. Yoon and S. Sull. GAMIN: Generative Adversarial Multiple Imputation Network for Highly Missing Data. In CVPR, pages 8453–8461, 2020.
- Handling Missing Data with Graph Representation Learning. Advances in Neural Information Processing Systems, 33:19075–19087, 2020.
- Y. C. Yuan. Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0). SAS Institute Inc, Rockville, MD, 49(1-11):12, 2010.
- S. Zheng and N. Charoenphakdee. Diffusion Models for Missing Value Imputation in Tabular Data. In NeurIPS 2022 First Table Representation Workshop, 2022.