Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReMatch: Retrieval Enhanced Schema Matching with LLMs (2403.01567v2)

Published 3 Mar 2024 in cs.DB and cs.AI

Abstract: Schema matching is a crucial task in data integration, involving the alignment of a source schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced LLMs. Our method avoids the need for predefined mapping, any model training, or access to data in the source database. Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4 (2011), 695 – 701. https://api.semanticscholar.org/CorpusID:6302654
  2. Language Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020). https://api.semanticscholar.org/CorpusID:218971783
  3. Choice overload: A conceptual review and meta-analysis. Journal of Consumer Psychology 25 (2015), 333–358. https://api.semanticscholar.org/CorpusID:46655935
  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
  5. Hong Hai Do and Erhard Rahm. 2002. COMA - A System for Flexible Combination of Schema Matching Approaches. In Very Large Data Bases Conference. https://api.semanticscholar.org/CorpusID:9318211
  6. User Validation in Ontology Alignment. In International Workshop on the Semantic Web. https://api.semanticscholar.org/CorpusID:1578751
  7. Avigdor Gal. 2011. Uncertain schema matching: the power of not knowing. In International Conference on Information and Knowledge Management. https://api.semanticscholar.org/CorpusID:43482147
  8. Learning to Rerank Schema Matches. IEEE Transactions on Knowledge and Data Engineering 33 (2021), 3104–3116. https://api.semanticscholar.org/CorpusID:143427155
  9. New and improved embedding model. https://openai.com/blog/new-and-improved-embedding-model
  10. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.
  11. Valentine: Evaluating Matching Techniques for Dataset Discovery. 2021 IEEE 37th International Conference on Data Engineering (ICDE) (2020), 468–479. https://api.semanticscholar.org/CorpusID:222378204
  12. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  13. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14 (2020), 50 – 60. https://api.semanticscholar.org/CorpusID:214743579
  14. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. Meta-Radiology (2023). https://api.semanticscholar.org/CorpusID:257921533
  15. Generic Schema Matching with Cupid. In Very Large Data Bases Conference. https://api.semanticscholar.org/CorpusID:1456533
  16. Deep Learning Based Text Classification: A Comprehensive Review. https://api.semanticscholar.org/CorpusID:235386502
  17. Large Language Models as General Pattern Machines. ArXiv abs/2307.04721 (2023). https://api.semanticscholar.org/CorpusID:259501163
  18. Deep Learning for Entity Matching: A Design Space Exploration. Proceedings of the 2018 International Conference on Management of Data (2018). https://api.semanticscholar.org/CorpusID:44063437
  19. Can Foundation Models Wrangle Your Data? Proc. VLDB Endow. 16 (2022), 738–746. https://api.semanticscholar.org/CorpusID:248965029
  20. Text and Code Embeddings by Contrastive Pre-Training. ArXiv abs/2201.10005 (2022). https://api.semanticscholar.org/CorpusID:246275593
  21. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt/
  22. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
  23. Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study. JMIR Medical Informatics 9 (2021). https://api.semanticscholar.org/CorpusID:244194789
  24. Improving Language Understanding by Generative Pre-Training. https://api.semanticscholar.org/CorpusID:49313245
  25. Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
  26. J. J. Rocchio. 1971. Relevance feedback in information retrieval. https://api.semanticscholar.org/CorpusID:61859400
  27. Learning to Characterize Matching Experts. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1236–1247. https://doi.org/10.1109/ICDE51399.2021.00111
  28. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation. Proc. VLDB Endow. 13 (2020), 1401–1415. https://api.semanticscholar.org/CorpusID:214588544
  29. How to Fine-Tune BERT for Text Classification?. In China National Conference on Chinese Computational Linguistics. https://api.semanticscholar.org/CorpusID:153312532
  30. Attention is All you Need. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:13756489
  31. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25, 3 (2018), 230–238.
  32. Large Language Models as Data Preprocessors. ArXiv abs/2308.16361 (2023). https://api.semanticscholar.org/CorpusID:261397017
  33. SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. Advances in databases and information systems. ADBIS 12843 (2021), 260–274. https://api.semanticscholar.org/CorpusID:237207055
  34. Schema Matching using Pre-Trained Language Models. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023), 1558–1571. https://api.semanticscholar.org/CorpusID:255188911
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Eitam Sheetrit (6 papers)
  2. Menachem Brief (3 papers)
  3. Moshik Mishaeli (3 papers)
  4. Oren Elisha (8 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com