Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Cleaning and Machine Learning: A Systematic Literature Review (2310.01765v2)

Published 3 Oct 2023 in cs.LG and cs.DB

Abstract: Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. (2022) Common problems. URL https://developers.google.com/machine-learning/gan/problems
  2. (2023) URL https://www.cnet.com/tech/chatgpt-can-pass-the-bar-exam-does-that-actually-matter/
  3. Aggarwal Charu C, Reddy Chandan K (2013) Data clustering: algorithms and applications
  4. Akouemo HN, Povinelli RJ (2017) Data improving in time series using arx and ann models. IEEE Transactions on Power Systems 32(5):3352–3359
  5. Alimohammadi H, Chen SN (2022) Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Systems with Applications 191:116371
  6. AP D (1967) Upper and lower probabilities induced by a multivalued mapping. The Annals of Mathematical Statistics 38(2):325–339
  7. Araci D (2019) Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:190810063
  8. Ataeyan M, Daneshpour N (2020) A novel data repairing approach based on constraints and ensemble learning. Expert Systems with Applications 159:113511
  9. Bagherzadeh P, Sadoghi Yazdi H (2017) Label denoising based on bayesian aggregation. International Journal of Machine Learning and Cybernetics 8:903–914
  10. Barlaug N, Gulla JA (2021) Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15(3):1–37
  11. Ben-Gal I (2005) Outlier detection in: Data mining and knowledge discovery handbook: A complete guide for practitioners and researchers
  12. Berti-Equille L (2019) Learn2clean: Optimizing the sequence of tasks for web data preparation. In: The World Wide Web Conference, pp 2580–2586
  13. Chasmai ME (2021) Cubetr: Learning to solve the rubiks cube using transformers. arXiv preprint arXiv:211106036
  14. Dempster AP, et al. (2008) Upper and lower probabilities induced by a multivalued mapping. Classic works of the Dempster-Shafer theory of belief functions 219(2):57–72
  15. Filippone M, Sanguinetti G (2010) Information theoretic novelty detection. Pattern Recognition 43(3):805–814
  16. Gal Y (2016) Uncertainty in deep learning
  17. Gitnux A (2023) Self driving cars safety statistics and trends in 2023 • gitnux. URL https://blog.gitnux.com/self-driving-cars-safety-statistics/
  18. Guo Z, Rekatsinas T (2019) Learning functional dependencies with sparse regression. 1905.01425
  19. Hawkins DM (1980) Identification of outliers, vol 11. Springer
  20. He Y, et al. (2021b) Automatic detection of grammatical errors in english verbs based on rnn algorithm: Auxiliary objectives for neural error detection models. Computational Intelligence and Neuroscience 2021
  21. Hernández-García A, König P (2018) Data augmentation instead of explicit regularization. arXiv preprint arXiv:180603852
  22. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780
  23. Johnson JM, Khoshgoftaar TM (2022) A survey on classifying big data with label noise. ACM Journal of Data and Information Quality 14(4):1–43
  24. Kim J, Scott CD (2012) Robust kernel density estimation. The Journal of Machine Learning Research 13(1):2529–2565
  25. Krishnan S, Wu E (2019) Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:190411827
  26. Lin WC, Tsai CF (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review 53:1487–1509
  27. Mahdavi M, Abedjan Z (2020) Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment 13(12):1948–1961
  28. Mahdavi M, Abedjan Z (2021) Semi-supervised data cleaning with raha and baran. In: CIDR
  29. Motulsky HJ, Brown RE (2006) Detecting outliers when fitting data with nonlinear regression–a new method based on robust nonlinear regression and the false discovery rate. BMC bioinformatics 7(1):1–20
  30. Ng A (2021) A chat with andrew on mlops: From model-centric to data-centric ai. URL https://www.youtube.com/watch?v=06-AZXmwHjo&ab_channel=DeepLearningAI
  31. OpenAI (2023) URL https://openai.com/research/gpt-4
  32. Rosner B (1983) Percentage points for a generalized esd many-outlier procedure. Technometrics 25(2):165–172
  33. Sambasivan N, Kapania S, Highfill H, Akrong D, Paritosh P, Aroyo LM (2021) “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp 1–15
  34. Silva-Ramírez EL, Cabrera-Sánchez JF (2021) Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Computing and Applications 33:8981–9004
  35. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
  36. Smyth L (2020) Training-valuenet: A new approach for label cleaning on weakly-supervised datasets. University of Exeter
  37. Surameery NMS, Shakor MY (2023) Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290 3(01):17–22
  38. Tawfik NS, Spruit MR (2020) Evaluating sentence representations for biomedical text: Methods and experimental results. Journal of biomedical informatics 104:103396
  39. Wang X, Wang C (2019) Time series data cleaning: A survey. Ieee Access 8:1866–1881
  40. Wei J, Zou K (2019) Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:190111196
  41. Wikipedia (2023a) URL https://en.wikipedia.org/wiki/Machine_learning
  42. Wikipedia (2023b) URL https://en.wikipedia.org/wiki/Imputation_(statistics)
  43. Wikipedia (2023c) Active learning (machine learning). URL https://en.wikipedia.org/wiki/Active_learning_(machine_learning)
  44. Wikipedia (2023d) Boosting (machine learning). URL https://en.wikipedia.org/wiki/Boosting_(machine_learning)
  45. Wikipedia (2023e) Transfer learning. URL https://en.wikipedia.org/wiki/Transfer_learning
  46. Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp 1–10
  47. Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. ProQuest Number: INFORMATION TO ALL USERS
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pierre-Olivier Côté (4 papers)
  2. Amin Nikanjam (39 papers)
  3. Nafisa Ahmed (2 papers)
  4. Dmytro Humeniuk (7 papers)
  5. Foutse Khomh (140 papers)
Citations (10)