Data Cleaning and Machine Learning: A Systematic Literature Review (2310.01765v2)
Abstract: Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.
- (2022) Common problems. URL https://developers.google.com/machine-learning/gan/problems
- (2023) URL https://www.cnet.com/tech/chatgpt-can-pass-the-bar-exam-does-that-actually-matter/
- Aggarwal Charu C, Reddy Chandan K (2013) Data clustering: algorithms and applications
- Akouemo HN, Povinelli RJ (2017) Data improving in time series using arx and ann models. IEEE Transactions on Power Systems 32(5):3352–3359
- Alimohammadi H, Chen SN (2022) Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Systems with Applications 191:116371
- AP D (1967) Upper and lower probabilities induced by a multivalued mapping. The Annals of Mathematical Statistics 38(2):325–339
- Araci D (2019) Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:190810063
- Ataeyan M, Daneshpour N (2020) A novel data repairing approach based on constraints and ensemble learning. Expert Systems with Applications 159:113511
- Bagherzadeh P, Sadoghi Yazdi H (2017) Label denoising based on bayesian aggregation. International Journal of Machine Learning and Cybernetics 8:903–914
- Barlaug N, Gulla JA (2021) Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15(3):1–37
- Ben-Gal I (2005) Outlier detection in: Data mining and knowledge discovery handbook: A complete guide for practitioners and researchers
- Berti-Equille L (2019) Learn2clean: Optimizing the sequence of tasks for web data preparation. In: The World Wide Web Conference, pp 2580–2586
- Chasmai ME (2021) Cubetr: Learning to solve the rubiks cube using transformers. arXiv preprint arXiv:211106036
- Dempster AP, et al. (2008) Upper and lower probabilities induced by a multivalued mapping. Classic works of the Dempster-Shafer theory of belief functions 219(2):57–72
- Filippone M, Sanguinetti G (2010) Information theoretic novelty detection. Pattern Recognition 43(3):805–814
- Gal Y (2016) Uncertainty in deep learning
- Gitnux A (2023) Self driving cars safety statistics and trends in 2023 • gitnux. URL https://blog.gitnux.com/self-driving-cars-safety-statistics/
- Guo Z, Rekatsinas T (2019) Learning functional dependencies with sparse regression. 1905.01425
- Hawkins DM (1980) Identification of outliers, vol 11. Springer
- He Y, et al. (2021b) Automatic detection of grammatical errors in english verbs based on rnn algorithm: Auxiliary objectives for neural error detection models. Computational Intelligence and Neuroscience 2021
- Hernández-García A, König P (2018) Data augmentation instead of explicit regularization. arXiv preprint arXiv:180603852
- Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780
- Johnson JM, Khoshgoftaar TM (2022) A survey on classifying big data with label noise. ACM Journal of Data and Information Quality 14(4):1–43
- Kim J, Scott CD (2012) Robust kernel density estimation. The Journal of Machine Learning Research 13(1):2529–2565
- Krishnan S, Wu E (2019) Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:190411827
- Lin WC, Tsai CF (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review 53:1487–1509
- Mahdavi M, Abedjan Z (2020) Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment 13(12):1948–1961
- Mahdavi M, Abedjan Z (2021) Semi-supervised data cleaning with raha and baran. In: CIDR
- Motulsky HJ, Brown RE (2006) Detecting outliers when fitting data with nonlinear regression–a new method based on robust nonlinear regression and the false discovery rate. BMC bioinformatics 7(1):1–20
- Ng A (2021) A chat with andrew on mlops: From model-centric to data-centric ai. URL https://www.youtube.com/watch?v=06-AZXmwHjo&ab_channel=DeepLearningAI
- OpenAI (2023) URL https://openai.com/research/gpt-4
- Rosner B (1983) Percentage points for a generalized esd many-outlier procedure. Technometrics 25(2):165–172
- Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo LM (2021) “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp 1–15
- Silva-Ramírez EL, Cabrera-Sánchez JF (2021) Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Computing and Applications 33:8981–9004
- Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
- Smyth L (2020) Training-valuenet: A new approach for label cleaning on weakly-supervised datasets. University of Exeter
- Surameery NMS, Shakor MY (2023) Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290 3(01):17–22
- Tawfik NS, Spruit MR (2020) Evaluating sentence representations for biomedical text: Methods and experimental results. Journal of biomedical informatics 104:103396
- Wang X, Wang C (2019) Time series data cleaning: A survey. Ieee Access 8:1866–1881
- Wei J, Zou K (2019) Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:190111196
- Wikipedia (2023a) URL https://en.wikipedia.org/wiki/Machine_learning
- Wikipedia (2023b) URL https://en.wikipedia.org/wiki/Imputation_(statistics)
- Wikipedia (2023c) Active learning (machine learning). URL https://en.wikipedia.org/wiki/Active_learning_(machine_learning)
- Wikipedia (2023d) Boosting (machine learning). URL https://en.wikipedia.org/wiki/Boosting_(machine_learning)
- Wikipedia (2023e) Transfer learning. URL https://en.wikipedia.org/wiki/Transfer_learning
- Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp 1–10
- Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. ProQuest Number: INFORMATION TO ALL USERS
- Pierre-Olivier Côté (4 papers)
- Amin Nikanjam (39 papers)
- Nafisa Ahmed (2 papers)
- Dmytro Humeniuk (7 papers)
- Foutse Khomh (140 papers)