Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories (2403.07653v1)

Published 12 Mar 2024 in cs.DB

Abstract: How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table representation learning techniques, do not take into consideration the rich set of column similarity signals found in prior matching and discovery methods. Finally, existing methods heavily depend on user-provided similarity thresholds, hindering their deployability in real-world settings. In this paper, we propose OmniMatch, a novel join discovery technique that detects equi-joins and fuzzy-joins betwen columns by combining column-pair similarity measures with Graph Neural Networks (GNNs). OmniMatch's GNN can capture column relatedness leveraging graph transitivity, significantly improving the recall of join discovery tasks. At the same time, OmniMatch also increases the precision by augmenting its training data with negative column join examples through an automated negative example generation process. Most importantly, compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Profiling relational data: a survey. The VLDB Journal 24 (2015), 557–581.
  2. Fuzzy joins using mapreduce. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 498–509.
  3. Discovering related data at scale. Proceedings of the VLDB Endowment 14, 8 (2021), 1392–1400.
  4. Dataset Discovery in Data Lakes. In IEEE ICDE.
  5. Enriching word vectors with subword information. TACL 5 (2017), 135–146.
  6. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
  7. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1335–1349.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  9. Customizable and scalable fuzzy join for big data. Proceedings of the VLDB Endowment 12, 12 (2019), 2106–2117.
  10. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proceedings of the VLDB Endowment 13, 9 (2021).
  11. Silkmoth: An efficient method for finding related sets with maximum matching constraints. arXiv preprint arXiv:1704.04738 (2017).
  12. Hong-Hai Do and Erhard Rahm. 2002. COMA—a system for flexible combination of schema matching approaches. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 610–621.
  13. Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 456–467.
  14. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models. Proc. VLDB Endow. 16, 10 (jun 2023), 2458–2470. https://doi.org/10.14778/3603581.3603587
  15. Usage-based schema matching. In 2008 IEEE 24th International Conference on Data Engineering. IEEE, 20–29.
  16. Daniel Engmann and Sabine Massmann. 2007. Instance Matching with COMA++.. In BTW workshops, Vol. 7. 28–37.
  17. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proceedings of the VLDB Endowment 16, 7 (2023), 1726–1739.
  18. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001–1012.
  19. Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–8.
  20. Seeping semantics: Linking datasets using word embeddings for data discovery. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 989–1000.
  21. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1500–1508.
  22. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
  23. SANTOS: Relationship-based Semantic Table Union Search. arXiv preprint arXiv:2209.13589 (2022).
  24. Integrating Data Lake Tables. Proceedings of the VLDB Endowment 16, 4 (2022), 932–945.
  25. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  26. REMA: Graph Embeddings-based Relational Schema Matching.
  27. Valentine: Evaluating matching techniques for dataset discovery. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 468–479.
  28. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In Proceedings of the 2021 International Conference on Management of Data. 1064–1076.
  29. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  30. Generic schema matching with cupid. In vldb, Vol. 1. Citeseer, 49–58.
  31. Christopher Manning and Hinrich Schutze. 1999. Foundations of statistical natural language processing. MIT press.
  32. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings 18th international conference on data engineering. IEEE, 117–128.
  33. Distributed representations of words and phrases and their compositionality. In NIPS.
  34. Table union search on open data. In VLDB.
  35. Synthesizing Products for Online Catalogs. Proceedings of the VLDB Endowment 4, 7 (2011), 409–418.
  36. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825–2830.
  37. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  38. Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching. the VLDB Journal 10, 4 (2001), 334–350.
  39. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
  40. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15. Springer, 593–607.
  41. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems 33 (2020), 16857–16867.
  42. Fast-join: An efficient method for fuzzy token matching based string similarity join. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 458–469.
  43. Mf-join: Efficient fuzzy string similarity join with multi-level filtering. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 386–397.
  44. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).
  45. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 97–108.
  46. On multi-column foreign key discovery. Proceedings of the VLDB Endowment 3, 1-2 (2010), 805–814.
  47. Automatic discovery of attributes in relational databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 109–120.
  48. Yi Zhang and Zachary G Ives. 2020. Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1951–1966.
  49. Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. (2022).
  50. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data. 847–864.

Summary

  • The paper introduces OmniMatch, which leverages diverse similarity metrics and graph neural networks to overcome the limitations of traditional join discovery methods.
  • It employs a self-supervised training process with fabricated join pairs and multi-hop message passing to enhance precision and recall in join predictions.
  • The method’s metadata independence and robust performance make it highly applicable for enriching large, uncurated tabular data repositories in real-world settings.

OmniMatch: Advancements in Self-Supervised Join Discovery

The paper presents OmniMatch, an innovative approach to self-supervised any-join discovery in tabular data repositories. The challenge of discovering join relationships among table columns is well-recognized, particularly when traditional similarity measures and metadata are insufficient or noisy. OmniMatch addresses these issues by focusing on both equi-joins and fuzzy-joins, utilizing an enhanced set of similarity metrics combined with Graph Neural Networks (GNNs).

Innovation in Join Discovery

OmniMatch leverages a wide array of similarity metrics between column pairs, surpassing the limitations of previous methods limited to static similarity metrics and thresholds. By employing a GNN, the approach benefits from graph transitivity, capturing relational semantics between column pairs that might otherwise be missed. The GNN-based design of OmniMatch is the key to effectively reconciling data noise and irregular value representations that standard methods struggle with.

Key Components and Methodology

  • Diverse Similarity Signals: OmniMatch incorporates multiple types of similarity measures, including Jaccard Similarity, Set Containment, embedding similarities using FastText, and Jensen-Shannon divergence. This comprehensive set allows for capturing both value overlaps and deeper semantic connections. Each of these metrics provides essential signals for constructing a robust similarity graph that models potential join relationships.
  • Graph Neural Networks: By representing columns and their relationships as graph nodes and edges, OmniMatch takes advantage of the GNN’s ability to aggregate neighboring data and learn from multi-hop relationships. The RGCN variant employed allows for capturing and sharing signals across different similarity metrics, enabling more nuanced join prediction through message passing.
  • Self-supervised Training Process: OmnicMatch automates the generation of suitable training data by fabricating join pairs. Through this process, it creates positive and negative examples by perturbing data values and ensures models are trained with realistic expectations of data variability.

Results and Implications

In performance evaluations against state-of-the-art methods, OmniMatch consistently demonstrates superior effectiveness—achieving up to 14% higher F1 and AUC scores than existing methods. Such significant enhancements are particularly evident in precision-recall benchmarks for datasets derived from diverse sources, proving the method's robustness and applicability in real-world settings.

The practical implications of OmniMatch are far-reaching. Its metadata independence and self-supervised nature make it applicable to extensive, uncurated repositories where metadata might be unreliable or unavailable. Additionally, the method provides organizations with a tool for enriching datasets through improved join discovery, which can lead to enhanced analytics and model training endeavors.

Conclusion and Future Prospects

OmniMatch represents a significant step forward in the landscape of data discovery and integration. Its capacity to self-supervise and adapt to multiple signals positions it as a versatile tool in data-centric organizations, whether dealing with structured databases or semi-structured data lakes. Future developments could explore the integration of OmniMatch with existing data retrieval and management systems, further augmenting its utility. Moreover, the methodology could expand to incorporate additional signals or adapt to evolving data representation technologies, maintaining relevance as data repositories grow in size and complexity.

In summary, OmniMatch introduces a robust, flexible approach to join discovery that bypasses the limitations of conventional methods, opening new pathways for efficient data integration and application.