Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs (2404.18681v1)

Published 29 Apr 2024 in cs.DB

Abstract: Machine learning's influence is expanding rapidly, now integral to decision-making processes from corporate strategy to the advancements in Industry 4.0. The efficacy of Artificial Intelligence broadly hinges on the caliber of data used during its training phase; optimal performance is tied to exceptional data quality. Data cleaning tools, particularly those that exploit functional dependencies within ontological frameworks or context models, are instrumental in augmenting data quality. Nevertheless, crafting these context models is a demanding task, both in terms of resources and expertise, often necessitating specialized knowledge from domain experts. In light of these challenges, this paper introduces an innovative approach, called LLMClean, for the automated generation of context models, utilizing LLMs to analyze and understand various datasets. LLMClean encompasses a sequence of actions, starting with categorizing the dataset, extracting or mapping relevant models, and ultimately synthesizing the context model. To demonstrate its potential, we have developed and tested a prototype that applies our approach to three distinct datasets from the Internet of Things, healthcare, and Industry 4.0 sectors. The results of our evaluation indicate that our automated approach can achieve data cleaning efficacy comparable with that of context models crafted by human experts.

Citations (3)

Summary

  • The paper presents an automated method using LLMs to generate context models for OFD-based cleaning.
  • It introduces prompt ensembling to overcome LLM variability, achieving higher F1 scores on diverse datasets.
  • The method outperforms baselines in error detection and repair for both IoT and non-IoT tabular data.

Data cleaning is a critical prerequisite for effective machine learning and data analysis, but real-world data often suffers from inaccuracies and inconsistencies. Traditional data cleaning tools rely on static rules or metadata, which can struggle with context-dependent errors. Context-aware cleaning methods, which leverage information about the data's origin and relationships, have shown promise, particularly through the use of Ontological Functional Dependencies (OFDs) derived from context models. However, manually creating and maintaining these context models is a complex, resource-intensive task requiring significant domain expertise.

The paper "LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs" (2404.18681) addresses this challenge by proposing an automated approach to generate context models from tabular data using LLMs. LLMClean aims to automate the creation of context models, making OFD-based data cleaning more practical and scalable.

LLMClean's architecture takes a dirty dataset as input and outputs flagged erroneous instances. The process involves several key steps:

  1. Context Model Generation: This is the core step, automated using LLMs.
    • The dataset's column names are extracted.
    • The dataset is categorized as either IoT or Non-IoT relational data.
    • For IoT Datasets: LLMs map dataset columns to concepts within a predefined meta-context model designed for IoT data. Missing concepts might be synthetically generated. The data is then transformed through sensor splitting (disaggregating multiple sensor readings per row into separate rows), column renaming (standardizing names to match the meta-model), and further column generation (adding missing but necessary columns like min/max sensor values). Automated sensor information extraction (using LLMs and external sources like Wikipedia/Wikidata) helps establish Capability Dependencies. Finally, the refined dataset is structured into an RDF graph representing a concrete instance of the context model.
    • For Non-IoT Datasets: LLMs analyze pairs of column names to identify semantic relationships and determine concepts. Relationships are established hierarchically in an RDF graph, linking less encompassing concepts to broader ones. This process focuses on identifying relationships relevant to Matching and Denial dependencies.
  2. OFD Rule Extraction: OFD rules (including Denial, Matching, Device-Link, Temporal, Location, Monitoring, and Capability dependencies for IoT, and subsets for Non-IoT) are extracted from the automatically generated context model.
  3. Error Detection: The extracted OFD rules are used to validate the input data. Violations of these rules (e.g., missing values, functional dependency violations) are flagged as erroneous instances. The paper describes specific methods for detecting missing values (mapping placeholders to NaN) and FD violations (using statistical modal value comparison).
  4. Error Correction (Optional but shown): Flagged erroneous instances can be fed into external error correction tools (like Baran, statistical imputation, or ML-based imputation) to generate repair candidates.

A significant aspect of LLMClean's implementation is its approach to leveraging LLMs, which can sometimes produce inconsistent results. To enhance stability and accuracy, LLMClean employs a Prompt Ensembling technique. This involves:

  1. Generating a variety of prompts from a baseline prompt.
  2. Enhancing prompts with few-shot examples from a training dataset.
  3. Evaluating individual prompts on training and validation sets.
  4. Exploring ensembles (combinations) of prompts.
  5. Finding a consensus among results from prompts within an ensemble based on a threshold.
  6. Selecting the ensemble configuration (combination of prompts and consensus threshold) that performs best (e.g., highest F1 score) on the training data and validating it on a separate validation set.

This ensembling approach helps mitigate the instability of individual LLM queries.

Implementation Considerations:

  • LLM Selection: The evaluation tested models like GPT-4, GPT-3.5, and various Llama2 configurations (7b, 13b, 70b). The choice impacts performance and computational cost.
  • Hardware: A GPU with at least 40GB VRAM is required, particularly for larger LLMs.
  • Data Specificity: The framework includes specific workflows and meta-models tailored for IoT data, acknowledging its distinct structural patterns compared to general relational data.
  • Synthetic Data Generation: When concepts or columns necessary for the context model are missing in the input data, LLMClean attempts to generate them synthetically, although this is not always possible for all dependencies.
  • External Data Dependency: Generating certain dependencies, like Capability Dependencies for IoT sensors, relies on accessing external information sources (like Wikipedia or Wikidata) based on sensor names found in the data.

Evaluation and Results:

LLMClean was evaluated on three real-world datasets: IoT, Hospital (non-IoT relational), and CONTEXT (Industry 4.0). An additional dataset, LM-KBC, was used specifically for evaluating the Prompt Ensembling technique.

  • Prompt Ensembling: Experiments on the LM-KBC dataset showed that increasing the number of few-shot examples and using Prompt Ensembling significantly improved prediction accuracy (F1 score). Larger LLM models generally performed better, but the ensembling technique allowed smaller models (e.g., Llama2 7b with ensembling) to outperform larger models relying on single best prompts (e.g., Llama2 13b with best prompt). Non-fine-tuned Llama2 with the best ensemble achieved competitive results.
  • Error Detection: LLMClean demonstrated superior detection F1 scores compared to various baselines (HoloClean, ED2, Raha, dBoost, MVD) across the IoT, Hospital, and CONTEXT datasets. While effective, LLMClean's runtime was slightly higher than some baselines due to the complexity of validating numerous column pairs and enforcing generated OFDs.
  • Error Repair: When combined with external repair tools, LLMClean's detected errors led to competitive or superior repair accuracy (lower RMSE for numerical, higher F1 for categorical) compared to other detection methods, particularly when paired with advanced repair techniques like Baran or ML-based imputation.

Discussion:

The evaluation suggests that automated context generation is effective, particularly for structured data like IoT datasets with predictable schemas and dependency types. LLMClean maintains effectiveness even on non-IoT datasets, although the complexity of generating context models for such diverse data can be higher. The prompt ensembling technique is crucial for improving the reliability of LLM outputs. While the method shows strong performance, its adaptability to complex, rapidly changing contexts depends on the scope and frequency of these changes. Reliance on external data for sensor capabilities can also be a factor.

In summary, LLMClean offers a practical, automated approach to generating context models using LLMs, enabling context-aware data cleaning through OFDs. This addresses a major bottleneck in applying advanced data cleaning techniques that rely on domain-specific context, showing promising results in terms of error detection and facilitating accurate repairs across different dataset types.