Annotating Columns with Pre-trained Language Models (2104.01785v2)

Published 5 Apr 2021 in cs.DB and cs.CL

Abstract: Inferring meta information about tables, such as column headers or relationships between columns, is an active research topic in data management as we find many tables are missing some of this information. In this paper, we study the problem of annotating table columns (i.e., predicting column types and the relationships between columns) using only information from the table itself. We develop a multi-task learning framework (called Doduo) based on pre-trained LLMs, which takes the entire table as input and predicts column types/relations using a single model. Experimental results show that Doduo establishes new state-of-the-art performance on two benchmarks for the column type prediction and column relation prediction tasks with up to 4.0% and 11.9% improvements, respectively. We report that Doduo can already outperform the previous state-of-the-art performance with a minimal number of tokens, only 8 tokens per column. We release a toolbox (https://github.com/megagonlabs/doduo) and confirm the effectiveness of Doduo on a real-world data science problem through a case study.

Citations (71)

View on Semantic Scholar

Summary

The paper introduces Doduo, a multi-task learning framework using pre-trained language models for automatically predicting column types and relationships in tables.
Doduo achieves state-of-the-art results on column type and relation prediction, demonstrating significant F1 score improvements with minimal cell data per column.
The paper presents Doduo's open-source implementation, offering a practical framework for automated table column annotation to enhance data management tasks.

Doduo: Multi-Task Column Annotation with Pre-trained LLMs

The paper "Annotating Columns with Pre-trained LLMs" (2104.01785) presents Doduo, a framework for annotating table columns by predicting column types and identifying relationships between columns using pre-trained LLMs (PLMs) in a multi-task learning (MTL) setting. The primary motivation is to address the common issue of missing metadata in relational tables, which hinders effective data management and analysis. Doduo leverages the contextual understanding capabilities of PLMs by processing the entire table content to infer these annotations.

Methodology

Doduo employs an MTL architecture built upon a PLM, typically BERT or its variants. The core idea is to train a single model to perform both column type prediction and column relation prediction simultaneously, leveraging shared representations learned from the table data.

Input Representation

To process tabular data with a sequence-based PLM, the table needs to be serialized. Doduo adopts a column-wise serialization strategy. For each column, a fixed number ( $k$ ) of representative tokens (cell values) are selected. These tokens, along with the column header (if available), are concatenated. Special tokens, such as [CLS] and [SEP], are used to structure the input sequence. For a table with $N$ columns, the input sequence might look like:

[CLS] col_header_1 [SEP] cell_1_1 [SEP] ... [SEP] cell_1_k [SEP] [CLS] col_header_2 [SEP] ... [SEP] cell_N_k [SEP]

The [CLS] token preceding each column's serialized representation serves as the aggregate representation for that column after passing through the PLM encoder. The paper explores the impact of $k$ , finding that even a small number of tokens (e.g., $k=8$ ) per column can yield strong performance.

Model Architecture

The Doduo architecture consists of a shared PLM encoder and two task-specific prediction heads:

Shared Encoder: A standard PLM (e.g., BERT-base-uncased) takes the serialized table representation as input and computes contextualized embeddings for each token. The embedding corresponding to the [CLS] token for each column $j$ , denoted as $h_j$ , is used as the column's representation.
Column Type Prediction Head: This head takes the column representation $h_j$ as input and feeds it through a linear layer followed by a softmax function to predict the probability distribution over a predefined set of column types (e.g., person name, location, date). The task is formulated as a multi-class classification problem for each column independently. The loss for this task ( $L_{type}$ ) is the sum of cross-entropy losses over all columns.
Column Relation Prediction Head: This head aims to predict the semantic relationship between pairs of columns $(i, j)$ . It takes the representations $h_i$ and $h_j$ of the two columns, concatenates them $[h_i; h_j]$ , and passes them through a linear layer followed by a softmax function. This predicts the probability distribution over a set of predefined binary relations (e.g., PK-FK, Superclass-Subclass, Same-Meaning). The task is formulated as a multi-class classification problem for relevant column pairs. The loss for this task ( $L_{rel}$ ) is the sum of cross-entropy losses over all considered column pairs.

The total loss for the multi-task model is a weighted sum of the individual task losses:

$L_{total} = \alpha L_{type} + (1 - \alpha) L_{rel}$

where $\alpha$ is a hyperparameter balancing the two tasks. The entire model is trained end-to-end via backpropagation.

Experimental Evaluation

Doduo was evaluated on established benchmarks for column type annotation and relation prediction.

Datasets and Metrics

SATO: A large dataset derived from VizNet, commonly used for column type prediction. It contains tables extracted from web sources and visualizations. Performance is typically measured using macro F1-score over a set of 78 semantic types.
TURL: A dataset designed for table understanding tasks, including column type annotation and column property annotation (which implicitly relates to relation prediction). It includes relational tables extracted from Wikipedia. Both column type prediction (using accuracy or F1) and relation prediction (often framed as predicting properties like isPrimaryKey) are evaluated. The paper focuses on the column relation prediction task within this benchmark, evaluating against a set of binary relations. Performance is measured using macro F1-score.

Baselines

Doduo was compared against several state-of-the-art methods existing at the time of publication, including:

Column Type Prediction: Sherlock, Sato (the method associated with the SATO dataset name, often feature-based SVMs or deep models like HybridNet).
Column Relation Prediction: Methods based on distributional semantics, schema matching techniques, and potentially prior PLM-based approaches if available.

Implementation Details

The authors used BERT-base-uncased as the primary PLM. Key hyperparameters included the number of tokens per column ( $k$ ), the batch size, learning rate, and the MTL loss weight $\alpha$ . Training was performed using standard optimization techniques like AdamW.

Results and Analysis

The experimental results demonstrated that Doduo achieved new state-of-the-art performance on both benchmark datasets.

Column Type Prediction (SATO): Doduo outperformed previous SOTA methods, achieving up to a 4.0% improvement in macro F1-score. This highlights the effectiveness of leveraging contextual information across the entire table via PLMs, compared to methods relying solely on individual column statistics or limited context.
Column Relation Prediction (TURL): Doduo showed a more substantial improvement, achieving up to an 11.9% increase in macro F1-score over prior SOTA. This suggests that PLMs are particularly adept at capturing the semantic relationships between columns, a task that often requires understanding the interplay of values across different columns.
Effect of Input Length ( $k$ ): A significant finding was that Doduo could achieve strong performance even with a minimal number of tokens per column. Using just $k=8$ tokens per column was sufficient to surpass previous SOTA results on both tasks. This has practical implications for computational efficiency, as shorter input sequences reduce processing time and memory requirements. Increasing $k$ further generally yielded marginal improvements, indicating a saturation point.
Multi-Task Learning: The MTL framework proved beneficial. Training both tasks jointly allowed the model to learn richer column representations by sharing information between the type and relation prediction objectives, likely leading to better generalization compared to training separate models for each task.

Practical Implementation and Tooling

The authors released an open-source toolbox implementing the Doduo framework: https://github.com/megagonlabs/doduo. This facilitates the application of the proposed method in practical scenarios.

Implementation Considerations

Computational Cost: Fine-tuning BERT-based models requires significant computational resources (GPUs). Inference time depends on table size (number of columns) and the chosen value of $k$ . The finding that small $k$ works well is advantageous for deployment.
Scalability: Processing very wide tables (large number of columns) can lead to excessively long input sequences, potentially exceeding the maximum sequence length limits of standard PLMs (e.g., 512 tokens for BERT). Strategies like table splitting or hierarchical processing might be needed for extremely large tables.
Domain Adaptation: Pre-trained models like BERT might require further fine-tuning on domain-specific tabular data to achieve optimal performance if the target domain differs significantly from the pre-training corpus (web text, Wikipedia).
Type/Relation Schema: Doduo requires a predefined set of column types and relation types. Defining an appropriate schema is crucial for practical applications and depends on the specific use case.
Handling Heterogeneity: Real-world tables often exhibit noise, missing values, and diverse data formats within columns. Robust preprocessing steps are necessary before applying Doduo. The selection strategy for the $k$ tokens per column can influence robustness to outliers or sparse columns.

Case Study

The paper includes a case paper demonstrating Doduo's application to a real-world data science problem, confirming its practical utility beyond benchmark datasets. This typically involves integrating Doduo into a data preparation or data discovery pipeline to automatically infer missing metadata, thereby accelerating downstream tasks like data integration, cleaning, or feature engineering.

Conclusion

Doduo provides an effective framework for table column annotation by leveraging pre-trained LLMs within a multi-task learning setup. It demonstrates significant improvements over previous state-of-the-art methods for both column type prediction and column relation prediction. Key strengths include its ability to capture inter-column context and its efficiency, achieving strong results with minimal input tokens per column. The availability of an open-source implementation further enhances its practical applicability in various data management tasks.

PDF Markdown

Related Papers

GitHub

GitHub - megagonlabs/doduo: Annotating Columns with Pre-trained Language Models (33 stars)