Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Valentine: Evaluating Matching Techniques for Dataset Discovery (2010.07386v2)

Published 14 Oct 2020 in cs.DB

Abstract: Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

This paper introduces Valentine, an extensible, open-source experiment suite designed to evaluate schema matching techniques specifically for the needs of dataset discovery. Dataset discovery, the process of finding relevant datasets in large data repositories, heavily relies on schema matching to identify relationships like joinability or unionability between tables. Despite its importance, there has been a lack of standardized evaluation frameworks, benchmark datasets with ground truth, and readily available implementations of schema matching algorithms tailored for this context.

The core contributions of the Valentine project address this gap by:

  1. Defining Schema Matching Scenarios for Dataset Discovery: Based on a survey of dataset discovery literature, the authors formalize four key relatedness scenarios encountered in practice: Unionable, View-Unionable, Joinable, and Semantically-Joinable relations. These scenarios capture different ways tables can be related, varying in schema overlap, instance overlap, and the presence of noise or semantic differences.
  2. Developing a Principled Dataset Fabrication Process: To provide sufficient data with ground truth for evaluation, Valentine includes a module to fabricate dataset pairs from existing tables. This process involves horizontal and vertical splitting, and introducing noise in schema names and data instances, simulating realistic data heterogeneity challenges. They augment this with a small set of human-curated, real-world dataset pairs from WikiData, Magellan Data, and a proprietary source (ING Bank).
  3. Implementing and Adapting Schema Matching Algorithms: Valentine integrates implementations of six prominent schema matching algorithms (Cupid, Similarity Flooding, COMA, Distribution-based Matching, SemProp, EmbDI) and a simple Jaccard-Levenshtein baseline. Crucially, these methods are adapted to the needs of dataset discovery by producing a ranked list of potential column matches rather than just a set of 1-1 matches, which is more useful for interactive data exploration.
  4. Creating an Open-Source Experimentation Suite: Valentine provides a unified framework to execute and organize large-scale automated matching experiments, combining different methods, parameter configurations, and dataset pairs. The suite and all experimental data (datasets, ground truth, results) are made openly available.
  5. Conducting a Comprehensive Evaluation: The authors performed an extensive evaluation (\sim75K experiments) using the fabricated and curated datasets, comparing the implemented methods across the defined scenarios.

Schema Matching in Dataset Discovery

The paper highlights that schema matching in dataset discovery differs from its traditional use (finding 1-1 correspondences for schema integration). For discovery, matching serves as a building block to identify and rank inter-dataset relationships, often requiring finding multiple potential matches for a single column. Dataset discovery methods utilize various types of matchers:

  • Attribute Overlap: Based on syntactic similarity of column names.
  • Value Overlap: Based on overlap between sets of data instances.
  • Semantic Overlap: Using external knowledge bases to link columns to semantic concepts.
  • Data Type: Filtering based on compatible data types.
  • Distribution: Comparing statistical distributions of data values.
  • Embeddings: Using word/value embeddings to capture semantic similarity.

Valentine includes methods covering these different matcher types, enabling comparison of various approaches.

Dataset Relatedness Scenarios

The four scenarios formalized in the paper are:

  • Unionable: Two tables have the same columns (possibly different names) and can be unioned. They have varying percentages of row overlap, simulating horizontal partitioning.
  • View-Unionable: Two tables share a subset of columns and can be unioned only after projecting onto those common columns. They have vertical splits and no row overlap, posing a challenge for instance-based methods.
  • Joinable: Two tables can be joined on at least one pair of columns which have overlapping instances. This simulates finding augmenting attributes for existing data. Variants have varying column overlap and high row overlap.
  • Semantically-Joinable: Similar to Joinable, but the instances in the joining columns might be noisy or semantically equivalent but syntactically different, requiring methods that capture semantic similarity beyond exact value matches (e.g., "USA" and "United States").

Evaluation Methodology

Effectiveness is measured using Recall@ground truth, defined as the number of correctly matched pairs among the top-kk results, where kk is the total number of ground truth matches. This metric reflects how well a method ranks the correct matches highly, which is crucial for user-assisted discovery. Traditional metrics like Precision/Recall are deemed less suitable as they apply to unranked 1-1 outputs based on a threshold. Efficiency is measured by average execution time per dataset pair.

Key Findings

The extensive evaluation led to several insights:

  • Schema-based methods: Perform well when attribute names are consistent but struggle significantly with noisy schemata or when match signals are primarily in data instances. They are generally the most efficient.
  • Instance-based methods: More robust to noisy schemata and often necessary for finding matches based on data content. However, they struggle with view-unionable data (no row overlap) and semantically-joinable data where instance values differ syntactically. They are significantly less efficient than schema-based methods.
  • Hybrid methods (SemProp, EmbDI): Show inconsistent results. SemProp underperforms, suggesting pre-trained embeddings might not be suitable for specific data domains. EmbDI's effectiveness is inconsistent, potentially due to randomness in training data generation and dependence on value overlap. They are often the least efficient, particularly EmbDI due to embedding training.
  • Parameter Sensitivity: Many methods are highly sensitive to parameter choices (especially thresholds), which can dramatically impact effectiveness, particularly on challenging datasets with noise or low overlap. Manually finding optimal parameters is difficult in practice.
  • Simple Baselines: The simple Jaccard-Levenshtein baseline demonstrates surprisingly good performance in some scenarios, highlighting the need for robust comparison against straightforward methods.
  • Human-Curated Data: Real-world datasets confirm the challenges, showing that Distribution-based matching can perform well on datasets with similar value distributions, while schema-based methods are strong when names align.

Lessons Learned

The authors conclude with several practical takeaways for practitioners and future research:

  • No Single Best Method: Different methods excel in different scenarios. A composite approach (like COMA's strategy) combining various matchers is likely necessary for robust dataset discovery systems.
  • Embeddings Need Improvement: While promising, current embedding-based methods need further research to be consistently effective, especially for specific data domains or when external knowledge is limited.
  • Parameterization is a Challenge: The need for complex, data-dependent parameter tuning hinders the practical applicability of many methods. Research should focus on "self-driving" or machine learning-based approaches that require less manual configuration.
  • Ranked Results are Key: Schema matching should be treated as a search problem, providing ranked lists of candidates for human users to explore and provide feedback. Humans-in-the-loop are essential for challenging cases.
  • Efficiency is Crucial: Instance-based methods are computationally expensive, particularly for large datasets. Future work needs to explore approximate methods or optimizations for scalability.

Valentine provides a valuable resource for researchers and practitioners by open-sourcing benchmark datasets, method implementations, and an evaluation framework, facilitating future advancements in schema matching for dataset discovery.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Christos Koutras (5 papers)
  2. George Siachamis (3 papers)
  3. Andra Ionescu (3 papers)
  4. Kyriakos Psarakis (7 papers)
  5. Jerry Brons (1 paper)
  6. Marios Fragkoulis (9 papers)
  7. Christoph Lofi (2 papers)
  8. Angela Bonifati (37 papers)
  9. Asterios Katsifodimos (19 papers)
Citations (62)