DeepJoin: Joinable Table Discovery with Pre-trained Language Models (2212.07588v2)
Abstract: Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained LLM (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.
- Hugging face transformers. https://huggingface.co/docs/transformers/index, 2022.
- Tabel: Entity linking in web tables. In ISWC, volume 9366 of Lecture Notes in Computer Science, pages 425–441. Springer, 2015.
- Wikitables. http://websail-fe.cs.northwestern.edu/TabEL/, 2015.
- Discovering related data at scale. PVLDB, 14(8):1392–1400, 2021.
- Dataset discovery in data lakes. In ICDE, pages 709–720, 2020.
- A. Z. Broder. On the resemblance and containment of documents. In SEQUENCES, pages 21–29. IEEE, 1997.
- A primitive operator for similarity joins in data cleaning. In ICDE, page 5. IEEE Computer Society, 2006.
- Pivot-based metric indexing. PVLDB, 10(10):1058–1069, 2017.
- Table search using a deep contextualized language model. In SIGIR, pages 589–598. ACM, 2020.
- ARDA: automatic relational data augmentation for machine learning. PVLDB, 13(9):1373–1387, 2020.
- Deep neural networks for youtube recommendations. In RecSys, pages 191–198. ACM, 2016.
- Silkmoth: An efficient method for finding related sets with maximum matching constraints. PVLDB, 10(10):1082–1093, 2017.
- TURL: table understanding through representation learning. PVLDB, 14(3):307–319, 2020.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186. ACL, 2019.
- Principles of Data Integration. Morgan Kaufmann, 2012.
- Y. Dong and M. Oyamada. Table enrichment system for machine learning. In SIGIR, pages 3267–3271. ACM, 2022.
- Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach. In ICDE, pages 456–467. IEEE, 2021.
- facebook AI research. Faiss: Facebook ai similarity search. https://github.com/facebookresearch/faiss/wiki/Faiss-indexes, 2022.
- Facebook AI Research Lab. fastText: Library for efficient text classification and representation learning. https://fasttext.cc/, 2015.
- D. Faraglia. Faker. https://github.com/joke2k/faker, 2023.
- Effective and scalable data discovery with nextiajd. In EDBT, pages 690–693. OpenProceedings.org, 2021.
- Tabular cell classification using pre-trained cell embeddings. In ICDM, pages 230–239. IEEE, 2019.
- SEMA-JOIN: joining semantically-related tables using big table corpora. PVLDB, 8(12):1358–1369, 2015.
- Auto-transform: Learning-to-transform by patterns. PVLDB, 13(11):2368–2381, 2020.
- Efficient natural language response suggestion for smart reply. CoRR, abs/1705.00652, 2017.
- Tapas: Weakly supervised table parsing via pre-training. In ACL, pages 4320–4333. ACL, 2020.
- Embedding-based retrieval in facebook search. In KDD, pages 2553–2561. ACM, 2020.
- Sherlock: A deep learning approach to semantic data type detection. In KDD, pages 1500–1508. ACM, 2019.
- TABBIE: pretrained representations of tabular data. In NAACL-HLT, pages 3446–3456. ACL, 2021.
- C. S. J. and W. Peter. Estimating the recall performance of web search engines. Aslib Proceedings, 49(7):184–189, Jan 1997.
- Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128, 2011.
- Tinybert: Distilling BERT for natural language understanding. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL, pages 4163–4174. ACL, 2020.
- Dense passage retrieval for open-domain question answering. In EMNLP, pages 6769–6781. ACL, 2020.
- Integrating data lake tables. PVLDB, 16(4):932–945, 2022.
- Text generation from knowledge graphs with graph transformers. In NAACL-HLT, pages 2284–2293. ACL, 2019.
- Valentine: Evaluating matching techniques for dataset discovery. In ICDE, pages 468–479. IEEE, 2021.
- Deep entity matching with pre-trained language models. PVLDB, 14(1):50–60, 2020.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836, 2020.
- Introduction to information retrieval. Cambridge University Press, 2008.
- Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
- Can foundation models wrangle your data? PVLDB, 16(4):738–746, 2022.
- Organizing data lakes for navigation. In SIGMOD, pages 1939–1950. ACM, 2020.
- Table union search on open data. PVLDB, 11(7):813–825, 2018.
- Nils Reimers. Sentencetransformers.losses. https://www.sbert.net/docs/package_reference/losses.html, 2019.
- High-dimensional similarity query processing for data science. In KDD, pages 4062–4063. ACM, 2021.
- N. Reimers. Sentence bert. https://www.sbert.net/, 2022.
- N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP-IJCNLP, pages 3980–3990. ACL, 2019.
- WDC web table corpus. http://webdatacommons.org/webtables/2015/downloadInstructions.html, 2015.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019.
- J. Song and Y. He. Auto-validate: Unsupervised data validation using data-domain patterns inferred from data lakes. In SIGMOD, pages 1678–1691. ACM, 2021.
- Mpnet: Masked and permuted pre-training for language understanding. In NeurIPS, 2020.
- Annotating columns with pre-trained language models. In SIGMOD, pages 1493–1503. ACM, 2022.
- Ember: No-code context enrichment via similarity-based keyless joins. PVLDB, 15(3):699–712, 2021.
- RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. PVLDB, 14(8):1254–1261, 2021.
- Attention is all you need. In NIPS, pages 5998–6008, 2017.
- Retrieving complex tables with multi-granular graph representation learning. In SIGIR, pages 1472–1482. ACM, 2021.
- TUTA: tree-based transformers for generally structured table pre-training. In KDD, pages 1780–1790. ACM, 2021.
- Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
- Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15:1–15:41, 2011.
- Tabert: Pretraining for joint understanding of textual and tabular data. In ACL, pages 8413–8426. ACL, 2020.
- Sato: Contextual semantic type detection in tables. PVLDB, 13(11):1835–1848, 2020.
- Table2vec: Neural word and entity embeddings for table population and retrieval. In SIGIR, pages 1029–1032. ACM, 2019.
- Y. Zhang and Z. G. Ives. Finding related tables in data lakes for interactive data science. In SIGMOD, pages 1951–1966, 2020.
- JOSIE: overlap set similarity search for finding joinable tables in data lakes. In SIGMOD, pages 847–864. ACM, 2019.
- Auto-join: Joining tables by leveraging transformations. PVLDB, 10(10):1034–1045, 2017.
- LSH ensemble: Internet-scale domain search. PVLDB, 9(12):1185–1196, 2016.