Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Data Cleaning Using Large Language Models (2410.15547v1)

Published 21 Oct 2024 in cs.DB

Abstract: Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy and recall. This work introduces Cocoon, a novel data cleaning system that leverages LLMs for rules based on semantic understanding and combines them with statistical error detection. However, data cleaning is still too complex a task for current LLMs to handle in one shot. To address this, we introduce Cocoon, which decomposes complex cleaning tasks into manageable components in a workflow that mimics human cleaning processes. Our experiments show that Cocoon outperforms state-of-the-art data cleaning systems on standard benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Retclean: Retrieval-based data cleaning using foundation models and data lakes, 2023.
  2. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow., 3(1):197–207, 2010.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 international conference on management of data, pages 2201–2206, 2016.
  5. Holistic data cleaning: Putting violations into context. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 458–469. IEEE, 2013.
  6. The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data.
  7. Victor Dibia. Lida: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. arXiv preprint arXiv:2303.02927, 2023.
  8. Wayne W Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, pages 1–32, 2002.
  9. Dead or alive: Continuous data profiling for interactive data science. IEEE Transactions on Visualization and Computer Graphics, 2023.
  10. Data ambiguity strikes back: How documentation improves gpt’s text-to-sql. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
  11. Transform table to database using large language models. Proceedings of the VLDB Endowment. ISSN, 2150:8097.
  12. Cocoon: Semantic table profiling using large language models. In Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, pages 1–7, 2024.
  13. Relationalizing tables with large language models: The promise and challenges. In 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW). IEEE, 2024.
  14. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of the International Working Conference on Advanced Visual Interfaces, pages 547–554, 2012.
  15. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  16. Baran: Effective error correction via a unified context representation and transfer learning. Proceedings of the VLDB Endowment, 13(12):1948–1961, 2020.
  17. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, pages 865–882, 2019.
  18. Eracer: a database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, page 75–86, New York, NY, USA, 2010. Association for Computing Machinery.
  19. Rayyan—a web and mobile app for systematic reviews. Systematic Reviews, 5(1):210, 2016.
  20. Fahes: A robust disguised missing values detector. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2100–2109, 2018.
  21. Cleanagent: Automating data standardization with llm-based agents, 2024.
  22. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.
  23. Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820, 2017.
  24. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.