Column Type Annotation using ChatGPT (2306.00745v2)

Published 1 Jun 2023 in cs.CL

Abstract: Column type annotation is the task of annotating the columns of a relational table with the semantic type of the values contained in each column. Column type annotation is an important pre-processing step for data search and data integration in the context of data lakes. State-of-the-art column type annotation methods either rely on matching table columns to properties of a knowledge graph or fine-tune pre-trained LLMs such as BERT for column type annotation. In this work, we take a different approach and explore using ChatGPT for column type annotation. We evaluate different prompt designs in zero- and few-shot settings and experiment with providing task definitions and detailed instructions to the model. We further implement a two-step table annotation pipeline which first determines the class of the entities described in the table and depending on this class asks ChatGPT to annotate columns using only the relevant subset of the overall vocabulary. Using instructions as well as the two-step pipeline, ChatGPT reaches F1 scores of over 85% in zero- and one-shot setups. To reach a similar F1 score a RoBERTa model needs to be fine-tuned with 356 examples. This comparison shows that ChatGPT is able deliver competitive results for the column type annotation task given no or only a minimal amount of task-specific demonstrations.

Citations (19)

View on Semantic Scholar

Summary

The paper investigates using ChatGPT's in-context learning capabilities for Column Type Annotation (CTA) through various prompting strategies, including zero-shot, few-shot, and a novel two-step pipeline.
Experiments on the SATO dataset show that ChatGPT, especially with instructions and the two-step pipeline in zero/one-shot settings, achieves over 85% F1, competitive with a RoBERTa baseline requiring significantly more labeled data for fine-tuning.
Leveraging LLMs like ChatGPT offers a strong alternative for CTA in low-resource settings, although practical implementation requires careful prompt engineering, API management, vocabulary handling, and considering costs.

Overview of ChatGPT for Column Type Annotation

The task of Column Type Annotation (CTA) involves assigning semantic types from a predefined vocabulary (e.g., schema.org, DBpedia ontology) to the columns of relational tables. This is a crucial step for enhancing data discovery, integration, and semantic understanding within data lakes and knowledge graphs. Traditionally, CTA methods have employed techniques like knowledge graph property matching or fine-tuning pre-trained LLMs (PLMs) like BERT on task-specific labeled data. The paper "Column Type Annotation using ChatGPT" (2306.00745) investigates an alternative approach leveraging the in-context learning capabilities of LLMs, specifically ChatGPT, for CTA, thereby potentially reducing or eliminating the need for extensive fine-tuning datasets. The paper evaluates various prompting strategies and introduces a two-step pipeline to enhance annotation accuracy, particularly for tables covering diverse domains.

Methodology: Prompting Strategies and Pipeline Design

The core methodology revolves around formulating effective prompts to guide ChatGPT in performing CTA. Several prompting configurations were explored:

Zero-Shot Learning: The model is provided only with the table data (column header and values) and the target type vocabulary, without any examples. The prompt essentially asks the model to assign the most appropriate type from the vocabulary to each column.
Few-Shot Learning (One-Shot): In addition to the table data and vocabulary, the prompt includes a single example of a correctly annotated table. This demonstration aims to guide the model's inference process by providing a template for the desired output format and annotation logic.
Task Definition and Instructions: The prompts were augmented with explicit definitions of the CTA task and detailed instructions on how to approach the annotation. This includes specifying the format of the input table, the expected output format, and potentially heuristics or guidelines for selecting types (e.g., consider column header semantics, value patterns, relationships between columns).
Two-Step Annotation Pipeline: Recognizing that the relevant semantic types for a table often depend on the primary entity type described by the table (e.g., 'Person', 'Organization', 'Product'), a two-step pipeline was implemented:
- Step 1 (Table Class Determination): ChatGPT is first prompted to identify the main entity class represented by the table, given the table data. This classification uses a predefined set of high-level entity classes.
- Step 2 (Targeted Column Annotation): Based on the determined entity class, the full type vocabulary is filtered to retain only the types relevant to that class. ChatGPT is then prompted again to perform CTA for the table columns, but using only this reduced, relevant subset of types. This aims to reduce ambiguity and constrain the model's output space, leading to more accurate annotations.

The implementation involves constructing text prompts incorporating the table data (serialized, often as comma-separated values or similar plain text formats), the type vocabulary (or a subset thereof), instructions, and optionally, few-shot examples. These prompts are then submitted to the ChatGPT API.

function annotate_table_pipeline(table, full_type_vocabulary, entity_classes, type_relevance_map):
  # Step 1: Determine Table Entity Class
  class_prompt = construct_class_determination_prompt(table, entity_classes)
  predicted_class = call_chatgpt_api(class_prompt)

  # Step 2: Filter Vocabulary and Annotate Columns
  relevant_types = get_relevant_types(predicted_class, type_relevance_map, full_type_vocabulary)
  annotation_prompt = construct_annotation_prompt(table, relevant_types, instructions="...") # Add instructions
  column_annotations = call_chatgpt_api(annotation_prompt)

  return column_annotations

function construct_class_determination_prompt(table, entity_classes):
  # Format table data and ask model to choose best class from entity_classes
  prompt = f"Table:\n{format_table(table)}\n\nEntity Classes: {', '.join(entity_classes)}\n\nWhat is the main entity class described by this table? Choose one from the list."
  return prompt

function construct_annotation_prompt(table, relevant_types, instructions):
   # Format table data, provide relevant types, instructions, and maybe few-shot examples
   prompt = f"Instructions: {instructions}\n\nTable:\n{format_table(table)}\n\nRelevant Types: {', '.join(relevant_types)}\n\nAnnotate each column with the best type from the list."
   # Optionally add one-shot example here
   return prompt

function get_relevant_types(predicted_class, type_relevance_map, full_type_vocabulary):
  # Lookup relevant type subset based on class, fallback to full vocabulary if needed
  if predicted_class in type_relevance_map:
    return type_relevance_map[predicted_class]
  else:
    return full_type_vocabulary

Experimental Evaluation

The evaluation was primarily conducted using the SATO dataset, a common benchmark for CTA derived from public data sources and annotated with types from schema.org, DBpedia, and DBPedia T2Dv2. Performance was measured using standard classification metrics: micro F1-score and macro F1-score, calculated over the column annotations.

The primary baseline for comparison was a RoBERTa model fine-tuned specifically for the CTA task on the SATO dataset. This represents a standard supervised learning approach using PLMs. The comparison aimed to assess how ChatGPT's zero-shot and few-shot performance, potentially enhanced by instructions and the two-step pipeline, measures up against a model requiring significant labeled data for fine-tuning.

Results and Performance Analysis

The experiments demonstrated that ChatGPT can achieve competitive performance on the CTA task with minimal or no task-specific training examples. Key findings include:

Zero-Shot Performance: Even without examples, ChatGPT achieved non-trivial F1 scores, indicating inherent capabilities for understanding table semantics and mapping them to types.
Impact of Instructions: Providing detailed instructions significantly boosted performance in the zero-shot setting. Explicit guidance on the task and expected output format proved crucial.
Few-Shot Learning: Adding just a single example (one-shot) further improved results compared to zero-shot, particularly when combined with instructions.
Two-Step Pipeline: The two-step approach yielded substantial improvements, especially on datasets with diverse entity types. By first identifying the table's main entity class and then using a filtered type vocabulary, the pipeline effectively reduced the complexity of the annotation task for the LLM.
Comparison with Fine-Tuning: ChatGPT, particularly when utilizing instructions and the two-step pipeline in zero-shot and one-shot settings, achieved F1 scores exceeding 85%. Notably, reaching a similar F1 score with the RoBERTa baseline required fine-tuning on approximately 356 labeled table examples. This highlights ChatGPT's potential for CTA in low-resource scenarios where labeled data is scarce.

The results suggest that LLMs like ChatGPT possess strong capabilities for semantic table understanding and can be effectively prompted for CTA. The performance is highly sensitive to prompt design, with instructions and few-shot examples providing significant benefits. The two-step pipeline offers a practical method to handle large type vocabularies and improve accuracy by contextualizing the annotation task based on the table's primary subject.

Implementation Considerations

Deploying ChatGPT for CTA involves several practical considerations:

Prompt Engineering: Crafting effective prompts is critical. This includes clear instructions, appropriate formatting of table data (handling large tables, special characters), selection of few-shot examples (if used), and managing the length constraints of the LLM's context window.
API Interaction: Integration requires managing API calls, handling potential rate limits, latency, and costs associated with using commercial LLM APIs. Error handling and retry mechanisms are necessary for robust operation.
Vocabulary Management: For large type vocabularies, passing the entire list in each prompt might be inefficient or exceed context limits. The two-step pipeline provides one mitigation strategy. Other approaches could involve embedding-based retrieval of relevant candidate types before prompting the LLM.
Reproducibility and Consistency: LLM outputs can exhibit variability even with fixed prompts (due to sampling parameters like temperature). Achieving deterministic results might require setting temperature to zero, but this can sometimes reduce output quality or creativity. Evaluating consistency across multiple runs is advisable.
Scalability: Processing large datasets requires efficient batching of requests or parallel API calls. The latency of API responses will influence overall throughput.
Cost: Utilizing commercial LLM APIs incurs costs based on token usage (input prompt length + output generation length). Optimizing prompt length and minimizing unnecessary tokens is important for cost-effectiveness, especially at scale. The two-step pipeline might increase the number of API calls but potentially reduces token usage per call in the second step due to the smaller vocabulary.
Domain Adaptation: While showing strong zero/few-shot performance, adaptation to highly specific or niche domains might still require carefully crafted prompts, domain-specific few-shot examples, or potentially fine-tuning if suitable LLMs become available.

Conclusion

The paper demonstrates that ChatGPT is a viable and effective tool for column type annotation, capable of achieving high accuracy with minimal or no task-specific training data through careful prompt engineering and strategies like the proposed two-step pipeline. It presents a compelling alternative to traditional methods requiring significant labeled data for fine-tuning, particularly advantageous in scenarios with limited annotation resources. The practical implementation hinges on effective prompt design, efficient API interaction, and potentially strategies to manage large type vocabularies.

PDF Markdown