ReMatch: Retrieval Enhanced Schema Matching with LLMs (2403.01567v2)
Abstract: Schema matching is a crucial task in data integration, involving the alignment of a source schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced LLMs. Our method avoids the need for predefined mapping, any model training, or access to data in the source database. Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.
- Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4 (2011), 695 – 701. https://api.semanticscholar.org/CorpusID:6302654
- Language Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020). https://api.semanticscholar.org/CorpusID:218971783
- Choice overload: A conceptual review and meta-analysis. Journal of Consumer Psychology 25 (2015), 333–358. https://api.semanticscholar.org/CorpusID:46655935
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
- Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. In Very Large Data Bases Conference. https://api.semanticscholar.org/CorpusID:9318211
- User Validation in Ontology Alignment. In International Workshop on the Semantic Web. https://api.semanticscholar.org/CorpusID:1578751
- Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https://api.semanticscholar.org/CorpusID:43482147
- Learning to Rerank Schema Matches. IEEE Transactions on Knowledge and Data Engineering 33 (2021), 3104–3116. https://api.semanticscholar.org/CorpusID:143427155
- New and improved embedding model. https://openai.com/blog/new-and-improved-embedding-model
- MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.
- Valentine: Evaluating Matching Techniques for Dataset Discovery. 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2020), 468–479. https://api.semanticscholar.org/CorpusID:222378204
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID:214743579
- Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. Meta-Radiology (2023). https://api.semanticscholar.org/CorpusID:257921533
- Generic Schema Matching with Cupid. In Very Large Data Bases Conference. https://api.semanticscholar.org/CorpusID:1456533
- Deep Learning Based Text Classification: A Comprehensive Review. https://api.semanticscholar.org/CorpusID:235386502
- Large Language Models as General Pattern Machines. ArXiv abs/2307.04721 (2023). https://api.semanticscholar.org/CorpusID:259501163
- Deep Learning for Entity Matching: A Design Space Exploration. Proceedings of the 2018 International Conference on Management of Data (2018). https://api.semanticscholar.org/CorpusID:44063437
- Can Foundation Models Wrangle Your Data? Proc. VLDB Endow. 16 (2022), 738–746. https://api.semanticscholar.org/CorpusID:248965029
- Text and Code Embeddings by Contrastive Pre-Training. ArXiv abs/2201.10005 (2022). https://api.semanticscholar.org/CorpusID:246275593
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt/
- OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
- Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study. JMIR Medical Informatics 9 (2021). https://api.semanticscholar.org/CorpusID:244194789
- Improving Language Understanding by Generative Pre-Training. https://api.semanticscholar.org/CorpusID:49313245
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- J. J. Rocchio. 1971. Relevance feedback in information retrieval. https://api.semanticscholar.org/CorpusID:61859400
- Learning to Characterize Matching Experts. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1236–1247. https://doi.org/10.1109/ICDE51399.2021.00111
- ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation. Proc. VLDB Endow. 13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID:214588544
- How to Fine-Tune BERT for Text Classification?. In China National Conference on Chinese Computational Linguistics. https://api.semanticscholar.org/CorpusID:153312532
- Attention is All you Need. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:13756489
- Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25, 3 (2018), 230–238.
- Large Language Models as Data Preprocessors. ArXiv abs/2308.16361 (2023). https://api.semanticscholar.org/CorpusID:261397017
- SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. Advances in databases and information systems. ADBIS 12843 (2021), 260–274. https://api.semanticscholar.org/CorpusID:237207055
- Schema Matching using Pre-Trained Language Models. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023), 1558–1571. https://api.semanticscholar.org/CorpusID:255188911
- Eitam Sheetrit (6 papers)
- Menachem Brief (3 papers)
- Moshik Mishaeli (3 papers)
- Oren Elisha (8 papers)