This paper introduces Valentine, an extensible, open-source experiment suite designed to evaluate schema matching techniques specifically for the needs of dataset discovery. Dataset discovery, the process of finding relevant datasets in large data repositories, heavily relies on schema matching to identify relationships like joinability or unionability between tables. Despite its importance, there has been a lack of standardized evaluation frameworks, benchmark datasets with ground truth, and readily available implementations of schema matching algorithms tailored for this context.
The core contributions of the Valentine project address this gap by:
- Defining Schema Matching Scenarios for Dataset Discovery: Based on a survey of dataset discovery literature, the authors formalize four key relatedness scenarios encountered in practice: Unionable, View-Unionable, Joinable, and Semantically-Joinable relations. These scenarios capture different ways tables can be related, varying in schema overlap, instance overlap, and the presence of noise or semantic differences.
- Developing a Principled Dataset Fabrication Process: To provide sufficient data with ground truth for evaluation, Valentine includes a module to fabricate dataset pairs from existing tables. This process involves horizontal and vertical splitting, and introducing noise in schema names and data instances, simulating realistic data heterogeneity challenges. They augment this with a small set of human-curated, real-world dataset pairs from WikiData, Magellan Data, and a proprietary source (ING Bank).
- Implementing and Adapting Schema Matching Algorithms: Valentine integrates implementations of six prominent schema matching algorithms (Cupid, Similarity Flooding, COMA, Distribution-based Matching, SemProp, EmbDI) and a simple Jaccard-Levenshtein baseline. Crucially, these methods are adapted to the needs of dataset discovery by producing a ranked list of potential column matches rather than just a set of 1-1 matches, which is more useful for interactive data exploration.
- Creating an Open-Source Experimentation Suite: Valentine provides a unified framework to execute and organize large-scale automated matching experiments, combining different methods, parameter configurations, and dataset pairs. The suite and all experimental data (datasets, ground truth, results) are made openly available.
- Conducting a Comprehensive Evaluation: The authors performed an extensive evaluation (75K experiments) using the fabricated and curated datasets, comparing the implemented methods across the defined scenarios.
Schema Matching in Dataset Discovery
The paper highlights that schema matching in dataset discovery differs from its traditional use (finding 1-1 correspondences for schema integration). For discovery, matching serves as a building block to identify and rank inter-dataset relationships, often requiring finding multiple potential matches for a single column. Dataset discovery methods utilize various types of matchers:
- Attribute Overlap: Based on syntactic similarity of column names.
- Value Overlap: Based on overlap between sets of data instances.
- Semantic Overlap: Using external knowledge bases to link columns to semantic concepts.
- Data Type: Filtering based on compatible data types.
- Distribution: Comparing statistical distributions of data values.
- Embeddings: Using word/value embeddings to capture semantic similarity.
Valentine includes methods covering these different matcher types, enabling comparison of various approaches.
Dataset Relatedness Scenarios
The four scenarios formalized in the paper are:
- Unionable: Two tables have the same columns (possibly different names) and can be unioned. They have varying percentages of row overlap, simulating horizontal partitioning.
- View-Unionable: Two tables share a subset of columns and can be unioned only after projecting onto those common columns. They have vertical splits and no row overlap, posing a challenge for instance-based methods.
- Joinable: Two tables can be joined on at least one pair of columns which have overlapping instances. This simulates finding augmenting attributes for existing data. Variants have varying column overlap and high row overlap.
- Semantically-Joinable: Similar to Joinable, but the instances in the joining columns might be noisy or semantically equivalent but syntactically different, requiring methods that capture semantic similarity beyond exact value matches (e.g., "USA" and "United States").
Evaluation Methodology
Effectiveness is measured using Recall@ground truth, defined as the number of correctly matched pairs among the top- results, where is the total number of ground truth matches. This metric reflects how well a method ranks the correct matches highly, which is crucial for user-assisted discovery. Traditional metrics like Precision/Recall are deemed less suitable as they apply to unranked 1-1 outputs based on a threshold. Efficiency is measured by average execution time per dataset pair.
Key Findings
The extensive evaluation led to several insights:
- Schema-based methods: Perform well when attribute names are consistent but struggle significantly with noisy schemata or when match signals are primarily in data instances. They are generally the most efficient.
- Instance-based methods: More robust to noisy schemata and often necessary for finding matches based on data content. However, they struggle with view-unionable data (no row overlap) and semantically-joinable data where instance values differ syntactically. They are significantly less efficient than schema-based methods.
- Hybrid methods (SemProp, EmbDI): Show inconsistent results. SemProp underperforms, suggesting pre-trained embeddings might not be suitable for specific data domains. EmbDI's effectiveness is inconsistent, potentially due to randomness in training data generation and dependence on value overlap. They are often the least efficient, particularly EmbDI due to embedding training.
- Parameter Sensitivity: Many methods are highly sensitive to parameter choices (especially thresholds), which can dramatically impact effectiveness, particularly on challenging datasets with noise or low overlap. Manually finding optimal parameters is difficult in practice.
- Simple Baselines: The simple Jaccard-Levenshtein baseline demonstrates surprisingly good performance in some scenarios, highlighting the need for robust comparison against straightforward methods.
- Human-Curated Data: Real-world datasets confirm the challenges, showing that Distribution-based matching can perform well on datasets with similar value distributions, while schema-based methods are strong when names align.
Lessons Learned
The authors conclude with several practical takeaways for practitioners and future research:
- No Single Best Method: Different methods excel in different scenarios. A composite approach (like COMA's strategy) combining various matchers is likely necessary for robust dataset discovery systems.
- Embeddings Need Improvement: While promising, current embedding-based methods need further research to be consistently effective, especially for specific data domains or when external knowledge is limited.
- Parameterization is a Challenge: The need for complex, data-dependent parameter tuning hinders the practical applicability of many methods. Research should focus on "self-driving" or machine learning-based approaches that require less manual configuration.
- Ranked Results are Key: Schema matching should be treated as a search problem, providing ranked lists of candidates for human users to explore and provide feedback. Humans-in-the-loop are essential for challenging cases.
- Efficiency is Crucial: Instance-based methods are computationally expensive, particularly for large datasets. Future work needs to explore approximate methods or optimizations for scalability.
Valentine provides a valuable resource for researchers and practitioners by open-sourcing benchmark datasets, method implementations, and an evaluation framework, facilitating future advancements in schema matching for dataset discovery.