CleanAgent: Automating Data Standardization with LLM-based Agents (2403.08291v4)
Abstract: Data standardization is a crucial part of the data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although LLMs like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing different column types, simplifying the LLM's code generation with concise API calls. We first propose Dataprep.Clean, a component of the Dataprep Python Library, significantly reduces the coding complexity by enabling the standardization of specific column types with a single line of code. Then, we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists only need to provide their requirements once, allowing for a hands-free process. To demonstrate the practical utility of CleanAgent, we developed a user-friendly web application, allowing users to interact with it using real-world datasets.
- Wes McKinney et al. 2024. pandas: powerful Python data analysis toolkit. https://pandas.pydata.org/ Accessed: 2024-01-25.
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. CoRR abs/2308.08155 (2023). https://doi.org/10.48550/ARXIV.2308.08155 arXiv:2308.08155
- The Rise and Potential of Large Language Model Based Agents: A Survey. CoRR abs/2309.07864 (2023). https://doi.org/10.48550/ARXIV.2309.07864 arXiv:2309.07864
- DB-GPT: Empowering Database Interactions with Private Large Language Models. CoRR abs/2312.17449 (2023). https://doi.org/10.48550/ARXIV.2312.17449 arXiv:2312.17449
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.